Artificial Intelligence šŸ¤–
LLM App Infrastructure
Retrieval-Augmented Generation (RAG)

Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) is a framework for building LLM-powered systems that make use of external data sources and applications to overcome some of the limitations of these models. With RAG, we are performing a semantic search across many text documents - these could be tens of thousands up to tens of billions of documents.

For example, to overcome knowledge cutoff, we could retrain the model on new data, but this would quickly become very expensive and require repeated retraining to regularly update the model with new knowledge. A more flexible and less expensive way to overcome cutoffs is to give the model access to additional external data at inference time. We can achieve this using RAG. Having access to external data can also avoid the problem of hallucination.

RAG is useful in any case where we want our LLM to have access to data it may not have seen during training such as new information, documents not included the training data or proprietary knowledge stored in our organization's private databases. This can improve the relevance and accuracy of our model's completions.

RAG enables you to use LLMs to query your data, transform it, and generate new insights. You can ask questions about your data, create chatbots, build semi-autonomous agents, and more.

RAGs have become the standard architecture for providing LLMs with context in order to avoid hallucinations. However even RAGs can suffer from hallucination, as is often the case when the retrieval fails to retrieve sufficient context or even retrieves irrelevant context that is then weaved into the LLMā€™s response.

Implementation

RAG is a framework and there are a number of different implementations available. The implementation discussed here is based on Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (opens in a new tab) (Facebook AI, 2020). The retriever is what drives RAG.

rag-retriever

It consists of a query encoder and an external data source. The query encoder takes the input prompt and encodes it in a form that can be used to query the external data source. In the paper, this data source is a vector store but it can be an SQL database, CSV files or other data storage formats.

These two components are trained together to find documents within the external data that are most relevant to the input query.

rag-training

The retriever returns the best single or group of documents from the data source and combines the new information with the original input prompt. The new expanded prompt is then passed to the LLM, which generates a completion that makes use of the data.

Example: Searching Legal Documents

Consider a lawyer who is using an LLM to help them in the discovery phase of a case. RAG can help them ask questions related to a corpus of documents such as previous court filings. Suppose they ask the model Who is the plaintiff in case 22-48710BI-SME?.

rag-retriever-legal-docs

This is passed to the query encoder, which encodes the prompt in the same format as the external documents. It then searches for a relevant entry in the corpus of documents. Having found a piece of text that contains the requested information, the retriever then combines the piece of text with the original prompt:

UNITED STATES DISTRICT COURT

SOUTHERN DISTRICT OF MAINE

CASE NUMBER: 22-48710BI-SME

Busy Industries (Plaintiff) vs State of Maine (Defendant)

Who is the plaintiff in case 22-48710BI-SME?

This expanded prompt is passed to the LLM. The model uses the information in the context of the prompt to generate a completion that contains the correct answer:

Busy Industries

While the example is quite simple and only returns a single piece of information that could be found by other means, RAG can also be used to generate summaries of filings or identify specific people, places and organizations within the full corpus of legal documents.

Possible Integrations

RAG can be used to integrate multiple types of external information sources. We can augment the LLM with access to:

  • Local documents, including private wikis and expert systems.
  • Internet to extract information posted on web pages such as Wikipedia.
  • Databases by training the query encoder to encode prompts into SQL queries.
  • Vector Stores, which contain vector representations of text. These enable a fast and efficient kind of relevance search based on similarity and are particularly useful for LLMs since internally LLMs work with vector representations of language to generate text.

Considerations in Implementation

Implementing RAG is a little more complicated than simply adding text into the LLM. There are a couple of key considerations.

  • Context Window Size - Most text sources are too long to fit in the limited context window of the model, which is still at most just a few thousand tokens. Instead, the external data sources are chopped up into many chunks, each of which will fit in the context window. Packages like LangChain can do this automatically.
  • Data Format - The external data should be available in a format that allows for easy retrieval of the most relevant text.

RAG Stages and Components

In RAG, your data is loaded and prepared for queries or ā€œindexedā€. User queries act on the index, which filters your data down to the most relevant context. This context and your query then go to the LLM along with a prompt, and the LLM provides a response. Even if what youā€™re building is a chatbot or an agent, youā€™ll want to know RAG techniques for getting data into your application.

RAG

There are five key stages within RAG, which in turn will be a part of any larger application you build. These are shown here:

RAG Stages

Overall, in a simpler depiction:

RAG Stages

Loading

This refers to getting your data from where it lives ā€“ whether itā€™s text files, PDFs, another website, a database, or an API ā€“ into your pipeline. LlamaHub provides hundreds of connectors (opens in a new tab) to choose from.

Indexing & Embedding

This means creating a data structure that allows for querying the data. For LLMs this nearly always means creating vector embeddings, numerical representations of the meaning of your data, as well as numerous other metadata strategies to make it easy to accurately find contextually relevant data by comparing their proximity to a query vector using a similarity metric like cosine similarity

Alt text

Improving Recall

For vector search to work, we need vectors. These vectors are essentially compressions of the "meaning" behind some text into (typically) 768 or 1536-dimensional vectors. There is some information loss because we're compressing this information into a single vector.

Because of this information loss, we often see that the top three (for example) vector search documents will miss relevant information. Unfortunately, the retrieval may return relevant information below our top_k cutoff.

What do we do if relevant information at a lower position would help our LLM formulate a better response? The easiest approach is to increase the number of documents we're returning (increase top_k) and pass them all to the LLM.

The metric we would measure here is recall ā€” meaning "how many of the relevant documents are we retrieving". Recall does not consider the total number of retrieved documents ā€” so we can hack the metric and get perfect recall by returning everything.

Recall@K=No.Ā RelevantĀ docsĀ returnedNo.Ā RelevantĀ docsĀ inĀ theĀ dataset\text{Recall@K} = \frac{\text{No. Relevant docs returned}}{\text{No. Relevant docs in the dataset}}

Unfortunately, we cannot return everything. LLMs have limits on how much text we can pass to them i.e the context window. Even then, we cannot use context stuffing because this reduces the LLM's recall performance ā€” note that this is the LLM recall, which is different from the retrieval recall we have been discussing so far.

When storing information in the middle of a context window, an LLM's ability to recall that information becomes worse than had it not been provided in the first place(Lost in the Middle: How Language Models Use Long Contexts (2023) (opens in a new tab))

We can increase the number of documents returned by our vector DB to increase retrieval recall, but we cannot pass these to our LLM without damaging LLM recall.

The solution to this issue is to maximize retrieval recall by retrieving plenty of documents and then maximize LLM recall by minimizing the number of documents that make it to the LLM. To do that, we reorder retrieved documents and keep just the most relevant for our LLM ā€” to do that, we use reranking.

Storing

Once your data is indexed you will almost always want to store your index, as well as other metadata, to avoid having to re-index it. This is usually a vector store such as pinecone. We discuss more about vector stores later.

Querying

An amazing property of vector embeddings is that if you take your question, and convert that into a vector, we will end up nearby the data that is relevant to that question. This is called semantic locality.

Semantic Locality

These chunks that are significant bits of context, can be sent along with the question to the LLM.

Evaluation

A critical step in any pipeline is checking how effective it is relative to other strategies, or when you make changes. Evaluation provides objective measures of how accurate, faithful and fast your responses to queries are.

Important Concepts

These are important parts of the Querying stage:

  • Retrievers: A retriever defines how to efficiently retrieve relevant context from an index when given a query. Your retrieval strategy is key to the relevancy of the data retrieved and the efficiency with which itā€™s done.
  • Routers: A router determines which retriever will be used to retrieve relevant context from the knowledge base. More specifically, the RouterRetriever class, is responsible for selecting one or multiple candidate retrievers to execute a query. They use a selector to choose the best option based on each candidateā€™s metadata and the query.
  • Node Postprocessors: A node postprocessor takes in a set of retrieved nodes and applies transformations, filtering, or re-ranking logic to them.
  • Response Synthesizers: A response synthesizer generates a response from an LLM, using a user query and a given set of retrieved text chunks.

Putting it together, we have on a higher level:

  • Query Engines: (opens in a new tab) A query engine is an end-to-end pipeline that allows you to ask questions over your data. It takes in a natural language query, and returns a response, along with reference context retrieved and passed to the LLM.
  • Chat Engines: (opens in a new tab) A chat engine is an end-to-end pipeline for having a conversation with your data (multiple back-and-forth instead of a single question-and-answer).
  • Agents: (opens in a new tab) An agent is an automated decision-maker powered by an LLM that interacts with the world via a set of tools. Agents can take an arbitrary number of steps to complete a given task, dynamically deciding on the best course of action rather than following pre-determined steps. This gives it additional flexibility to tackle more complex tasks.

Vector Stores

We know that LLMs create vector representations of each token in an embedding space, which allow them to identify semantically related words through measures such as cosine similarity.

Thus, when using vector stores, we take each chunk of external data and process them through an LLM to create embedding vectors for each. These are then stored in the vector store, allowing for fast searching of datasets and efficient identification of semantically related text.

Vector databases are a particular implementation of a vector store where each vector is also identified by a key. This can allow, for instance, the text generated by RAG to also include a citation for the document from which it was received.

Examples: Chroma (opens in a new tab), Pinecone (opens in a new tab), Weaviate (opens in a new tab), Faiss (opens in a new tab), QDrant (opens in a new tab). You can see a comparison of vector databases here

The Multi-Modal RAG Stack

How does this generalise to the Multimodal case. In fact, itā€™s very similar. The only difference is that the data you index could be images or text. You can embed images and audio, but you cant use the same embedding model. You store them separately, but often in the same database.

Multimodal RAG Stack

At the retrieval stage, you can use the same query to retrieve both text and images, and then use the context to generate a response.

RAG

Evaluation

Evaluation provides objective measures of how accurate, faithful and fast your responses to queries are.

Feedback Functions

Feedback functions, analogous to labeling functions, provide a programmatic method for generating evaluations on an application run. The TruLens implementation of feedback functions wrap a supported providerā€™s model, such as a relevance model or a sentiment classifier, that is repurposed to provide evaluations. Often, for the most flexibility, this model can be another LLM.

It can be useful to think of the range of evaluations on two axis: Scalable and Meaningful.

Alt text

  • Domain Expert (Ground Truth) Evaluations: In early development stages, we recommend starting with domain expert evaluations. These evaluations are often completed by the developers themselves and represent the core use cases your app is expected to complete. This allows you to deeply understand the performance of your app, but lacks scale.
  • User Feedback (Human) Evaluations: After you have completed early evaluations and have gained more confidence in your app, it is often useful to gather human feedback. This can often be in the form of binary (up/down) feedback provided by your users. This is more slightly scalable than ground truth evals, but struggles with variance and can still be expensive to collect.
  • Traditional NLP Evaluations: Next, it is a common practice to try traditional NLP metrics for evaluations such as BLEU and ROUGE. While these evals are extremely scalable, they are often too syntatic and lack the ability to provide meaningful information on the performance of your app.
  • Medium Language Model Evaluations (like BERT) can be a sweet spot for LLM app evaluations at scale. This size of model is relatively cheap to run (scalable) and can also provide nuanced, meaningful feedback on your app. In some cases, these models need to be fine-tuned to provide the right feedback for your domain. TruLens provides a number of feedback functions out of the box that rely on this style of model such as groundedness NLI, sentiment, language match, moderation and more.
  • Large Language Model Evaluations can also provide meaningful and flexible feedback on LLM app performance. Often through simple prompting, LLM-based evaluations can provide meaningful evaluations that agree with humans at a very high rate. Additionally, they can be easily augmented with LLM-provided reasoning to justify high or low evaluation scores that are useful for debugging. Depending on the size and nature of the LLM though, these evaluations can be quite expensive at scale.

The RAG Triad

TruEra created the RAG triad to evaluate for hallucinations along each edge of the RAG architecture, shown below:

Rag Triad

The RAG triad is made up of 3 evaluations: context relevance, groundedness and answer relevance. Satisfactory evaluations on each provides us confidence that our LLM app is free form hallucination.

Context Relevance

The first step of any RAG application is retrieval; to verify the quality of our retrieval, we want to make sure that each chunk of context is relevant to the input query. This is critical because this context will be used by the LLM to form an answer, so any irrelevant information in the context could be weaved into a hallucination. TruLens enables you to evaluate context relevance by using the structure of the serialized record.

Groundedness

After the context is retrieved, it is then formed into an answer by an LLM. LLMs are often prone to stray from the facts provided, exaggerating or expanding to a correct-sounding answer. To verify the groundedness of our application, we can separate the response into individual claims and independently search for evidence that supports each within the retrieved context.

Answer Relevance

Last, our response still needs to helpfully answer the original question. We can verify this by evaluating the relevance of the final response to the user input.

Putting it together

By reaching satisfactory evaluations for this triad, we can make a nuanced statement about our applicationā€™s correctness; our application is verified to be hallucination free up to the limit of its knowledge base. In other words, if the vector database contains only accurate information, then the answers provided by the RAG are also accurate.

Alignment: Honest, Harmless and Helpful Evaluations

TruLens also adapts the ā€˜honest, harmless, helpfulā€™ alignment usually seen alongside RLHF as desirable criteria for LLM apps. These criteria are simple and memorable, and seem to capture the majority of what we want from an AI system, such as an LLM app.

TruLens Implementation of HHH

To accomplish these evaluations TruLens has built out a suite of evaluations (feedback functions) that fall into each category, shown below. These feedback functions provide a starting point for ensuring your LLM app is performant and aligned.

TrueLens Application


Resources: