How to Build RAG Systems (5-Step Framework)

Shaw Talebi

140 views • 5 hours ago

Video Summary

This video outlines a systematic five-step approach to building robust Retrieval Augmented Generation (RAG) systems for real-world applications, moving beyond impressive demos to reliable deployments. The process begins with scoping a Minimum Viable Product (MVP) by defining users, use cases, and data sources, emphasizing the importance of starting small and specific. It then details creating a "golden dataset" of query-context pairs for ground truth evaluation, followed by building baseline retrieval and response systems. Crucially, the video highlights the necessity of rigorous end-to-end evaluation through error analysis and experimentation, advocating for methodical, single-variable changes to identify optimal design choices. A key insight is that a common pitfall is overscoping the initial RAG system, leading to stalled projects; instead, focusing on high-leverage use cases for a single user and data source can generate significant impact and serve as a foundation for future expansion.

Short Highlights

RAG (Retrieval Augmented Generation) grounds AI responses in provided context, mitigating hallucinations and managing specific knowledge.
A common pitfall is the transition from impressive RAG demos to disappointing real-world deployments due to a lack of systematic building.
A 5-step framework is proposed: 1. Scope the MVP, 2. Create a Golden Dataset, 3. Build the Retrieval System, 4. Build the Answer System, 5. Run Experiments.
The MVP scoping involves defining users, use cases, and data sources, advocating for a focused start rather than an all-encompassing system.
Systematic evaluation through error analysis and experimentation is key to identifying and improving specific failure modes in RAG systems.

Key Details

Scoping the MVP [0:46]

The initial step focuses on project management fundamentals, not technology, to avoid common pitfalls where excitement in AI leads to neglecting basic project discovery.
Key considerations for scoping include identifying the target users (technical/non-technical, roles, teams) and understanding their specific use case (e.g., customer support, legal, insurance claims), including what the system will and will not do.
A common pitfall is overscoping the initial version to serve all users, use cases, and data sources, which is resource-intensive and may not justify the investment.
The strategy should be to focus on high-leverage use cases that serve a single user and use case with a few data sources, as this small percentage generates most of the impact.
An example is building an answer engine for YouTube video transcripts targeting students of an AI cohort for technical Q&A, starting with one data source.
Starting simple and specific allows for implementation within a reasonable timeframe (e.g., 6 weeks) and provides a clear understanding of next steps.

A pro tip here is to start simple and specific because if you overcope the MVP, one of two things is going to happen: one, you're not going to build a great system because you just don't have enough time to give each component of the system the attention that it needs. Or two, you don't have enough time to actually build the thing properly.

Creating a Golden Dataset [8:15]

A "golden dataset" consists of pairs of queries and their relevant context, serving as ground truth to evaluate the retrieval system's performance.
This dataset enables systematic, data-driven improvement of the RAG system by comparing retrieval results against the ground truth.
The concept generalizes beyond question-answering to any RAG system accepting user input, such as matching insurance claims to existing ones.
A crucial technical note is the importance of a development-testing split for the golden dataset to prevent overfitting and ensure generalization to new user inputs.
The best method for creating a golden dataset is by using real-world queries from existing systems, then manually curating correct context and search results.
When real-world queries are unavailable, synthetic query generation using an LLM is an option, starting with the inventory of source documents.
Iterating on prompt design for synthetic query generation is critical, as low-quality queries will lead to a low-quality RAG system.
Prompts for query generation should define axes like personas and use cases, breaking them down into specific combinations (e.g., manager/individual contributor, onboarding/policy lookup).
Grounding synthetic generation in available real-world data, such as YouTube comments for a YouTube answer engine, can enhance query quality and diversity.
Example query types include factual, conceptual, and procedural, with varying difficulty levels (grounded, medium, hard) to cover a wide range of user needs.
Manual review of synthetic queries is essential, as they are not perfect; a smaller, high-quality dataset is preferable to a larger, low-quality one.

It's important that you create a development testing split for your golden data set. All that means is you're going to reserve some percentage of your golden data set for the development work.

Building the Initial Retrieval System [15:22]

This step focuses on creating a baseline retrieval system by processing source documents into a database, which can be vector-based, lexical, or a hybrid.
Evaluating the retrieval system's success is paramount, not just building it. This involves defining retrieval evaluation metrics.
Three popular metrics are discussed:
- Precision: The percentage of retrieved chunks that are relevant (TP / (TP + FP)).
- Recall at K: The percentage of all relevant chunks retrieved within the top K results (Relevant in Top K / Total Relevant). This metric is useful when the presence of relevant documents is prioritized.
- Mean Reciprocal Rank (MRR): An aggregate metric that evaluates performance across multiple queries based on the rank of the top most relevant chunk.
The choice of evaluation metric depends on the specific use case and priorities; for the YouTube answer engine example, recall was the primary evaluation metric.
Aggregating these metrics across the entire dataset provides an overall sense of the retrieval system's performance.

Precision is just the percentage of retrieved chunks which are relevant. Recall is the percentage of all relevant chunks that are retrieved. And then finally MR captures the performance of a set of queries based on ranking.

Building the Initial Response System [19:58]

Steps three and four (retrieval and response systems) are intentionally kept separate to allow for independent development and evaluation of these core RAG components.
The response system involves taking a user request, using it to initiate retrieval, and then feeding the retrieved context and instructions to an LLM to generate a response.
Two main approaches exist:
- Classic RAG: A linear workflow where the user query directly triggers retrieval, and results are passed to the LLM. This is simple but can lead to unnecessary retrievals for non-contextual queries or failed retrievals for unclear queries.
- Agentic RAG: Leverages LLM tool-calling capabilities, where the LLM decides when to use a retrieval tool. This simplifies query routing and rewriting but introduces potential failure modes where the agent retrieves when it shouldn't or vice-versa.
Starting with Classic RAG is recommended for simplicity, with a transition to Agentic RAG being easier once the initial system is stable.
End-to-end evaluation of the RAG system is more complex than retrieval evaluation, as generic NLP metrics don't capture the nuances of open-ended responses.
Error analysis is the primary method for evaluating responses: manually reviewing model outputs for user inputs and leaving open-ended notes on issues.
Patterns in errors emerge after reviewing a significant number of responses (e.g., 30-100). Common failures can be categorized with specialized tags (e.g., "bad framing" where the model assumes user-provided context).
To facilitate this, custom data viewers can be built to streamline the review process.
Automating evaluations can be done via code-based checks (for predictable patterns like specific phrases) or LLM-based checks (using an LLM judge), though LLM judges require careful alignment with manual labels.
Code-based evaluations are simpler and more transparent, while LLM-based evaluations can handle more complex edge cases but are harder to align and understand.

While the LLM judge might sound like magic, there's of course a gotcha here, which is that the alignment of the LLM judge is not trivial.

Running Experiments [30:00]

This final step involves creating different versions of the RAG system and systematically evaluating them against the baseline.
The project should be designed for experimentation, not immediate production, as optimal design choices are rarely known at the outset and are discovered through testing.
Experiments should focus on changing one thing at a time to isolate the impact of each modification on evaluation metrics.
The retrieval system is often the best starting point for improvements, as issues frequently lie within the source documents, extraction, pre-processing, or chunking.
Other areas for experimentation include specialized indexes (by data modality or use case), retrieval parameters, embedding models, tagging chunks, hybrid search, and rerankers.
Improvements can also be made to the query process (rewriting, agentic RAG) or the prompt sent to the LLM, or by using a more powerful LLM or query routing.
The number of design choices is vast, making evaluations and error analysis crucial for guiding improvements based on specific failures.
For instance, if "bad framing" is a common failure, prompt updates can be tested; if recall needs improvement, focus shifts to retrieval system aspects like pre-processing or chunking.
Each experiment should aim to correct a specific failure, with a corresponding metric to track progress, ensuring that changes are objectively measured and their impact is clear.

When trying to make a improvement to your system, start with the failure that you're trying to correct and have a metric that is a representation of that failure.