Two NEW n8n RAG Strategies (Anthropic’s Contextual Retrieval & Late Chunking)

The AI Automators

17,032 views • 5 months ago

Video Summary

The "lost context" problem in RAG agents, where agents fail to answer questions accurately or hallucinate responses, stems from how documents are chunked and embedded. Standard RAG systems split documents into independent chunks, losing the contextual links between them. This results in crucial information not being retrieved.

Two new techniques aim to solve this: "late chunking" and "contextual retrieval with context caching." Late chunking embeds the entire document first using long-context embedding models, preserving context, before chunking. Contextual retrieval uses LLMs to generate descriptive blurbs for each chunk within the document's context, which are then embedded, improving accuracy and reducing hallucinations.

Implementing these techniques, particularly late chunking, can be complex and requires custom workflows in tools like N8N. While these methods offer significant improvements in retrieval accuracy, they also introduce challenges related to ingestion time, cost, and rate limits, especially with large documents.

Short Highlights

The "lost context" problem in RAG agents occurs when document chunks are processed independently, losing the connection between them, leading to inaccurate retrieval and hallucinations.
Late chunking addresses this by embedding entire documents first using long-context embedding models before segmenting them, preserving inter-chunk context.
Contextual retrieval with context caching enhances chunks by using LLMs to generate descriptive blurbs based on the entire document context, which are then embedded.
Implementing these advanced techniques often requires custom workflows, particularly in tools like N8N, due to limitations in native support for custom embedding models and parameters.
Both late chunking and contextual retrieval can lead to increased ingestion time, higher costs, and potential rate limit issues, especially when dealing with very large documents.

Key Details

The Lost Context Problem in RAG Agents [0:00]

Agents can fail to answer questions accurately or hallucinate answers due to the "lost context" problem.
This occurs when documents are split into segments (chunks) for processing, and these chunks are treated independently.
In a standard RAG system, chunks are sent to an embedding model independently, causing them to lose context from the original document and other chunks.
If a query uses a word not present in a specific chunk, that chunk might receive a low retrieval score, even if it's contextually relevant.
This leads to incomplete or inaccurate answers and can result in hallucinations when unrelated chunks are retrieved and used by the LLM.

The best way to explain this problem is through an example.

Late Chunking [3:17]

Late chunking is a new approach to chunking that involves contextual chunk embeddings using long-context embedding models.
Instead of chunking first and then embedding (standard RAG), late chunking embeds first and then chunks.
This is made possible by embedding models with increasingly long context windows, allowing entire documents or large portions of them to be embedded simultaneously.
In this approach, every token within the document gets a vector embedding while maintaining the document's overall context.
After embedding, the document is chunked using a chosen strategy (sentences, paragraphs, fixed length, etc.).
Instead of sending these chunks for embedding, the previously created embeddings associated with the text of each chunk are identified.
These embeddings are then pooled or aggregated (averaged) to represent the chunk, and this aggregated vector is stored.
This method preserves the links between sentences and paragraphs because all embeddings were created with the full document context.
Long context embedding models like Mistral and Quentin support up to 32,000 tokens, and models like Gina AI's embedding V3 support just under 8,000 tokens, allowing for larger context processing.

With the late chunking approach, we're actually embedding first and then we're chunking.

Implementing Late Chunking in N8N [6:49]

Implementing late chunking in N8N requires manual work because the platform doesn't support custom embedding models directly or allow passing custom parameters to embedding models for this technique.
The workflow involves fetching a file (e.g., from Google Drive), extracting text, and checking its size.
If the text is too large for the embedding model's context window, a summary is created, and the document is split into large segments (e.g., 28,000 characters).
These large segments are then processed further with more granular chunking (e.g., 1,000 characters with 200 characters overlap).
This granular chunking is done using custom JavaScript nodes because N8N's native text splitters are sub-nodes of vector stores and not standalone.
The list of granular chunks is sent to the embedding model in a single batch.
The embedding model is called with a flag to enable late chunking and the task set to "retrieval.passage" for indexing.
The resulting vectors are then upserted into a vector store like Quadrant.
Testing shows that this approach can retrieve more comprehensive information compared to a simple RAG setup.

It was pretty fast and you can see the vector store here with 645 points for that 170 page document.

Contextual Retrieval with Context Caching [13:43]

This technique leverages the long context window of Large Language Models (LLMs), not embedding models, to provide context to each chunk.
The process involves splitting a document into chunks and then sending each chunk along with the original document to an LLM.
The LLM analyzes the chunk within the document's context and generates a short, descriptive blurb explaining how the chunk fits into the document.
This descriptive blurb is then combined with the original chunk.
This combined text (chunk + blurb) is then sent to an embedding model to produce the vectors, which are stored in a vector database.
This approach helps retain context, so even if a chunk doesn't contain a specific keyword, its associated blurb provides context for retrieval.
Key challenges with this method include the time taken for document ingestion and the cost, as the entire document might be sent repeatedly for each chunk.
Prompt caching can significantly reduce the cost by avoiding repeated sending of large documents.
The technique can be implemented using LLMs with context caching capabilities, like Gemini 1.5 Flash, which has a long context window and is relatively inexpensive.
For context caching, files typically need to be larger than a certain token threshold (e.g., 32,000 tokens for Gemini).
The process involves encoding the file, sending it for context caching to obtain an ID, and then using this ID when generating descriptive blurbs for each chunk.
A batching strategy is crucial to avoid hitting LLM rate limits, processing chunks in smaller groups with pauses in between.
Quantitative evaluations suggest a significant reduction in chunk retrieval failure rates when using this method, especially when combined with re-ranking.

So that gives you back a one-sentence description and from there then you add that descriptive blurb with the chunk and you send that into the embedding model.