Artificial Intelligence šŸ¤–
LLM App Infrastructure
Embedding Models & Retrieval

Embedding Models & Retrieval

I have explored embeddings like instructor-large, as well as models from the simple-transformers library. You can compare the embedding models on the Massive Text Embedding Benchmark (MTEB) Leaderboard (opens in a new tab). If even the best embedding models are unsatisfactory, there are some tricks to improve the quality of the retrieved text, but it requires more compute:

  • Retrieve more text extract, and re-rank them.
    • For the re-ranker, have been suggested bge-reranker-large
  • Use different length windows when embedding (for example, a length of 1000 and 500, and you can use different model). So you're effectively embedding your document multiple times.
  • In your retrieve pipeline, use an LLM to extract only the relevant part (you can then re-embed the extracted text, and re-rank it).
  • Try HyDE approach (opens in a new tab)
    • HyDE creates a ā€œHypotheticalā€ answer with the help of LLM and then searches the embeddings for a match. Here we are doing answer to answer embedding similarity search as compared to query to answer embedding similar search in traditional RAG retrieval approach.

HyDE

  • Implement RAG-fusion (really powerful...)
    • It performs multiple query generation and Reciprocal Rank Fusion to re-rank search results.
  • Store two version of every chunks: one for generating the embeddings, where you remove every stopword, apply lemmarization etc and another with the original text, that will be sent to the LLM as context.
  • Add "hybrid search" results in the reranking or rag fusion step, using bm25 or similar classic lexical algorithms

Choosing an Embedding Model

It is better to use an Encoder-Decoder Models (Sequence-to-Sequence Models) (T5) or Encoder-only Models (Autoencoding Models) (BERT, Roberta...) than Decoder-only Models (Autoregressive Models) like llama or Mistral. It is possible to extract embedding from decoder-only model, but that's do not generate a good trade offer.

If you use a bigger model, it takes more computation power and therefore more time to go through all your data. Encoder-only Models are also able to do bidirectional embedding (they encode all the information in the sentence, not just what meaning is necessary to generate the next word) and should be more accurate.

A really good model based on T5 in my opinion is instructor-xl (3B parameters, with "instruction following capabilities" while generating embeddings) or bge-xxl (11B). other smaller models (near 1B parameters,) that perform similar (or even better) to those xl models are bge-large-v1.5 (as bi encoder) and bge-reranker-large (as cross encoder).

Re-Rankers

We cannot just return loads of documents to fill up the LLM context (context stuffing) because this reduces the LLM's recall performance - note that this is the LLM recall, which is different from the retrieval recall as:

Recall@K=No.Ā RelevantĀ docsĀ returnedNo.Ā RelevantĀ docsĀ inĀ theĀ dataset\text{Recall@K} = \frac{\text{No. Relevant docs returned}}{\text{No. Relevant docs in the dataset}}

When storing information in the middle of a context window, an LLM's ability to recall that information becomes worse than had it not been provided in the first place (Lost in the Middle: How Language Models Use Long Contexts (2023) (opens in a new tab))

When storing information in the middle of a context window, an LLM's ability to recall that information becomes worse than had it not been provided in the first place

The solution to this issue is to maximize retrieval recall by retrieving plenty of documents and then maximize LLM recall by minimizing the number of documents that make it to the LLM. To do that, we reorder retrieved documents and keep just the most relevant for our LLM - to do that, we use reranking.

A reranking model - also known as a cross-encoder - is a type of model that, given a query and document pair, will output a similarity score. We use this score to reorder the documents by relevance to our query. Here is a two-stage retrieval system. The vector DB step will typically include a bi-encoder or sparse embedding model.

A two-stage retrieval system. The vector DB step will typically include a bi-encoder or sparse embedding model.

Search engineers have used rerankers in two-stage retrieval systems for a long time. In these two-stage systems, a first-stage model (an embedding model/retriever) retrieves a set of relevant documents from a larger dataset. Then, a second-stage model (the reranker) is used to rerank those documents retrieved by the first-stage model.

We use two stages because retrieving a small set of documents from a large dataset is much faster than reranking a large set of documents - we'll discuss why this is the case soon - but TL;DR, rerankers are slow, and retrievers are fast.

Why Rerankers?

If a reranker is so much slower, why bother using them? The answer is that rerankers are much more accurate than embedding models.

The intuition behind a bi-encoder's inferior accuracy is that bi-encoders must compress all of the possible meanings of a document into a single vector - meaning we lose information. Additionally, bi-encoders have no context on the query because we don't know the query until we receive it (we create embeddings before user query time).

On the other hand, a reranker can receive the raw information directly into the large transformer computation, meaning less information loss. Because we are running the reranker at user query time, we have the added benefit of analyzing our document's meaning specific to the user query - rather than trying to produce a generic, averaged meaning.

Rerankers avoid the information loss of bi-encoders - but they come with a different penalty - time. A bi-encoder model compresses the document or query meaning into a single vector. Note that the bi-encoder processes our query in the same way as it does documents, but at user query time

bi-encoder model

When using bi-encoder models with vector search, we frontload all of the heavy transformer computation to when we are creating the initial vectors - that means that when a user queries our system, we have already created the vectors, so all we need to do is:

  1. Run a single transformer computation to create the query vector.
  2. Compare the query vector to document vectors with cosine similarity (or another lightweight metric).

With rerankers, we are not pre-computing anything. Instead, we're feeding our query and a single other document into the transformer, running a whole transformer inference step, and outputting a single similarity score. A reranker considers query and document to produce a single similarity score over a full transformer inference step. Note that document A below is equivalent to our query.

A reranker considers query and document to produce a single similarity score over a full transformer inference step. Note that document A below is equivalent to our query

Given 40M records, if we use a small re-ranking model like BERT on a V100 GPU - we'd be waiting more than 50 hours to return a single query result. We can do the same in << 100ms with encoder models and vector search.


Resources: