Vector Databases

Vector databases are a type of database that stores data as high-dimensional vectors, which are mathematical representations of features or attributes, and can support fast maximum inner-product search (MIPS (opens in a new tab)). To optimize the retrieval speed, the common choice is the approximate nearest neighbors (ANN) algorithm to return approximately top k nearest neighbors to trade off a little accuracy lost for a huge speedup.

ANN Algorithms

A couple common choices of ANN algorithms for fast MIPS:

LSH (opens in a new tab) (Locality-Sensitive Hashing): It introduces a hashing function such that similar input items are mapped to the same buckets with high probability, where the number of buckets is much smaller than the number of inputs.
ANNOY (opens in a new tab) (Approximate Nearest Neighbors Oh Yeah): The core data structure are random projection trees, a set of binary trees where each non-leaf node represents a hyperplane splitting the input space into half and each leaf stores one data point. Trees are built independently and at random, so to some extent, it mimics a hashing function. ANNOY search happens in all the trees to iteratively search through the half that is closest to the query and then aggregates the results. The idea is quite related to KD tree but a lot more scalable.
HNSW (opens in a new tab) (Hierarchical Navigable Small World): It is inspired by the idea of small world networks (opens in a new tab) where most nodes can be reached by any other nodes within a small number of steps; e.g. "six degrees of separation" feature of social networks. HNSW builds hierarchical layers of these small-world graphs, where the bottom layers contain the actual data points. The layers in the middle create shortcuts to speed up search. When performing a search, HNSW starts from a random node in the top layer and navigates towards the target. When it can’t get any closer, it moves down to the next layer, until it reaches the bottom layer. Each move in the upper layers can potentially cover a large distance in the data space, and each move in the lower layers refines the search quality.
FAISS (opens in a new tab) (Facebook AI Similarity Search): It operates on the assumption that in high dimensional space, distances between nodes follow a Gaussian distribution and thus there should exist clustering of data points. FAISS applies vector quantization by partitioning the vector space into clusters and then refining the quantization within clusters. Search first looks for cluster candidates with coarse quantization and then further looks into each cluster with finer quantization.
ScaNN (opens in a new tab) (Scalable Nearest Neighbors): The main innovation in ScaNN is anisotropic vector quantization. It quantizes a data point $x_{i}$ to $x_{i}^{*}$ such that the inner product $\langle q, x_{i}\rangle$ is as similar to the original distance of $x_{i}$ as possible, instead of picking the closet quantization centroid points.

Comparison of MIPS algorithms, measured in recall@10.

Here is the comparison of MIPS algorithms, measured in recall@10. (Image source: Google Blog, 2020 (opens in a new tab)). Check more MIPS algorithms and performance comparison in ann-benchmarks.com (opens in a new tab).

Vector Database Choices

Vector databases are specialized storage systems designed for efficient management of dense vectors and support advanced similarity search, while vector libraries are integrated into existing DBMS or search engines to enable similarity search within a broader database context. The choice between the two depends on the specific requirements and scale of the application.

Elasticsearch (opens in a new tab): A distributed search and analytics engine that supports various types of data. One of the data types that Elasticsearch supports is vector fields, which store dense vectors of numeric values. In version 7.10, Elasticsearch added support for indexing vectors into a specialized data structure to support fast kNN retrieval through the kNN search API. In version 8.0, Elasticsearch added support for native natural language processing (NLP) with vector fields.
Faiss (opens in a new tab): A library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. It is developed primarily at Meta’s Fundamental AI Research group.
- A lot of vector databases are built on top of this, including pgvector and Pinecone
Milvus (opens in a new tab): An open-source vector database that can manage trillions of vector datasets and supports multiple vector search indexes and built-in filtering. It is a cloud-native vector database solution that can manage unstructured data. It supports automated horizontal scaling and uses acceleration methods to enable high-speed retrieving of vector data.
- Milvus supports multiple approximate nearest neighbor algorithm based indices like IVF_FLAT, Annoy, HNSW, RNSG, etc.
Qdrant (opens in a new tab): A vector similarity search engine and vector database. It provides a production-ready service with a convenient API to store, search, and manage points - vectors with an additional payload. Qdrant is tailored to extended filtering support. It makes it useful for all sorts of neural-network or semantic-based matching, faceted search, and other applications.
Chroma (opens in a new tab): An AI-native open-source embedding database. It is simple, feature-rich, and integrable with various tools and platforms for working with embeddings. It also provides a JavaScript client and a Python API for interacting with the database.
- Claims to be the first AI-centric vector db. looks really promising, but from what I can tell, there's no persistence available when self-hosting, meaning it's more like a service you spin up, load data into, and when you kill the process it goes away.
- What is most interesting to me about chromadb is that it has a time-series function, which might make it appropriate for streaming real-life data events into over long time periods and doing queries over time series.
OpenSearch (opens in a new tab): A community-driven, open source fork of Elasticsearch and Kibana following the license change in early 2021. It includes a vector database functionality that allows you to store and index vectors and metadata, and perform vector similarity search using k-NN indexes.
Weaviate (opens in a new tab): An open-source vector database that allows you to store data objects and vector embeddings from your favorite ML-models, and scale seamlessly into billions of data objects.
Vespa (opens in a new tab): A fully featured search engine and vector database. It supports vector search (ANN), lexical search, and search in structured data, all in the same query. Integrated machine-learned model inference allows you to apply AI to make sense of your data in real time.
pgvector (opens in a new tab): An open-source extension for PostgreSQL that allows you to store and query vector embeddings within your database. It is built on top of the Faiss library, which is a popular library for efficient similarity search of dense vectors. pgvector is easy to use and can be installed with a single command.
- Basic postgres extension. open source, free, ubiquitous, no frills
- Apparently doesn't benchmark very well
- pgvector is great for integrating with your relational metadata but not as fast as the best-of-breed vector dbs
Vald (opens in a new tab): A highly scalable distributed fast approximate nearest neighbor dense vector search engine. Vald is designed and implemented based on the Cloud-Native architecture. It uses the fastest ANN Algorithm NGT to search neighbors. Vald has automatic vector indexing and index backup, and horizontal scaling which made for searching from billions of feature vector data.
- Vald uses a distributed index graph to support asynchronous indexing. It stores each index in multiple agents which enables index replicas and ensures high availability.
- Vald is also open-source and free to use. It can be deployed on a Kubernetes cluster and the only cost incurred is that of the infrastructure.
Apache Cassandra (opens in a new tab): An open source NoSQL distributed database trusted by thousands of companies. Vector search is coming to Apache Cassandra in its 5.0 release, which is expected to be available in late 2023 or early 2024. This feature is based on a collaboration between DataStax and Google, who are working on integrating Apache Cassandra with Google’s open source vector database engine, ScaNN.
ScaNN (opens in a new tab) (Scalable Nearest Neighbors, Google Research): A library for efficient vector similarity search, which finds the k nearest vectors to a query vector, as measured by a similarity metric. Vector similarity search is useful for applications such as image search, natural language processing, recommender systems, and anomaly detection.
Pinecone (opens in a new tab): A vector database that is designed for machine learning applications. It is fast, scalable, and supports a variety of machine learning algorithms. Pinecone is built on top of Faiss, a library for efficient similarity search of dense vectors.
- Pinecone is a fully managed vector database
- It offers features like filtering, vector search libraries, and distributed infrastructure for the key benefit of reliability and speed.
- It's hosted, and the free plan only gives you one index, and paid plans are expensive.
- Mentioning it here because it seems to be the de-facto for most projects so it's good to know about, but it ain't self-hosted.
Marqo (opens in a new tab): simple to use, comes with embedding and inference management, supports multi-modal, handles chunking of text/images and much more.
Embeddinghub (opens in a new tab): An open-source solution designed to store machine learning embeddings with high durability and easy access. It allows intelligent analysis, like approximate nearest neighbor operations, and regular analysis, like partitioning and averaging. It uses HNSW algorithm for indexing the embeddings using HNSWLib, offering a high performance approximate nearest neighbor lookup.
Redis (opens in a new tab): Apparently you can use Redis, which is typically thought of as a key-value store, and is very readily available. Redis is super popular in the Rails community. Probably a fine choice. It's free, open source, fast as F (for key/value stuff anyway). Here is a quick start (opens in a new tab).

Common Features

Vector databases and vector libraries are both technologies that enable vector similarity search, but they differ in functionality and usability:

Vector databases can store and update data, handle various types of data sources, perform queries during data import, and provide user-friendly and enterprise-ready features.
Vector libraries can only store data, handle vectors only, require importing all the data before building the index, and require more technical expertise and manual configuration.

Some vector databases are built on top of existing libraries, such as Faiss (opens in a new tab). This allows them to take advantage of the existing code and features of the library, which can save time and effort in development.

These vector databases & libraries are used in artificial intelligence (AI) applications such as machine learning, natural language processing, RAG and image recognition. They share some common features:

They support vector similarity search, which finds the k nearest vectors to a query vector, as measured by a similarity metric. Vector similarity search is useful for applications such as image search, natural language processing, recommender systems, and anomaly detection.
They use vector compression techniques to reduce the storage space and improve the query performance. Vector compression methods include scalar quantization, product quantization, and anisotropic vector quantization.
They can perform exact or approximate nearest neighbor search, depending on the trade-off between accuracy and speed. Exact nearest neighbor search provides perfect recall, but may be slow for large datasets. Approximate nearest neighbor search uses specialized data structures and algorithms to speed up the search, but may sacrifice some recall.
They support different types of similarity metrics, such as L2 distance, inner product, and cosine distance. Different similarity metrics may suit different use cases and data types.
They can handle various types of data sources, such as text, images, audio, video, and more. Data sources can be transformed into vector embeddings using machine learning models, such as word embeddings, sentence embeddings, image embeddings, etc.

When choosing a vector database, it is important to consider your specific needs and requirements.

It is also important to note that Vector Databases are made specifically for working with vector embeddings (store the vectors efficiently but also search and perform mathematical operations on them), but should not be used as a persistent storage for data.

Resources:

Retrieval-Augmented Generation (RAG)Embedding Models & Retrieval