Why Your RAG Pipeline Feels Slow: Avoid Sending Entire Pages to the LLM

Date: 2026-05-17

Diagnose and fix latency in your Retrieval-Augmented Generation pipeline by preventing oversized inputs to your LLM. Learn why chunk deduplication and token logging are essential.

Tags: ["RAG", "Azure Open AI", "Performance", "Vector Search", "LLM Optimization"]

Retrieval-Augmented Generation (RAG) pipelines empower developers to build AI systems that deliver highly relevant answers by combining embeddings, vector search, and large language models (LLMs). Yet in practice, latency often kills user experience, especially when the final synthesis with an LLM takes tens of seconds.

In a recent deep diagnostic dive of a .NET-based RAG implementation using Elasticsearch KNN search and GPT-4.1-mini, the root cause of agonizing delays wasn’t the model itself, but rather oversized inputs stemming from sending entire pages chunked into multiple fragments for synthesis — effectively overwhelming the LLM with repeated content.

This blog post breaks down why chunk-level deduplication by source document is critical, how proper logging exposed the bottleneck, and what concrete fixes reduced token counts and latency. If your RAG pipeline runs slow, these insights will guide you to audit, observe, and optimize without swapping costly model versions.

Architecture Overview

This pipeline starts with embedding the incoming user query, using these embeddings to fetch the top vector-similar chunks from Elasticsearch, then applying a reranker to order these chunks by relevance before selecting which chunks are passed to the GPT-4.1-mini model for final synthesis.

Diagnostic logging reveals oversized chunk payload causing latency
Diagnostic logs showing chunk source URIs and token counts — source: Jamie Maguire

Key Technical Observations

Chunk-Level KNN Returns Multiple Paragraphs Per Document — When the corpus is split into paragraph-level chunks, Elasticsearch's KNN search often returns several highly relevant chunks from the exact same document. Without filtering, the same document floods the synthesis input with redundant information.
Lack of Deduplication Results in Oversized Payloads — The initial pipeline didn’t group or deduplicate chunks by their DocumentUri, leading to nearly 20,000 tokens sent to the LLM per query, exceeding practical limits for interactive applications.
Token Count Logging is Vital for Latency Diagnosis — Adding diagnostic logs to capture token counts and chunk counts before each synthesis call immediately exposes the expensive payload size contributing to slowdowns.
Multiple LLM Calls Per Single User Query Can Multiply Latency — Three synthesis calls were happening per request instead of one. One pathway bypassed chunk cleaning entirely, causing high token counts and ballooning response times.
Semantic Cache Invisible Without Proper Logging Level — A semantic similarity cache existed but was effectively dead because hits were only logged at Debug level, invisible in production logs, while scoring thresholds mismatched between code paths.
Consistent KNN Querying is Essential for Cache Effectiveness — Different KNN query construction between cache lookups and main vector searches led to incomparable similarity scores, making the cache never hit and adding unnecessary compute.

How It Works

The core workflow exposes several phases that collectively determine both the correctness and the performance of the RAG pipeline.

1. Query Embedding and KNN Search

When a query arrives, it is embedded into vector space, after which Elasticsearch's KNN method retrieves the top N chunks by similarity score. These chunks typically represent paragraph-sized pieces of documentation or knowledge base articles.

// Initial chunk cleaning method before fix
private IEnumerable CleanChunks(IEnumerable chunks)
{
    return chunks
        .Where(c => c.Score >= MinRelevanceScore)
        .OrderByDescending(c => c.Score)
        .Take(MaxChunksForSynthesis);
}

While simple and clean, this approach allowed multiple chunks from one document to pass through, multiplying tokens unnecessarily.

2. Reranking and Chunk Deduplication

The fix adds grouping by DocumentUri and limits the number of chunks selected per document to prevent flooding the LLM with repeated content.

// Improved chunk cleaning with group-by-document deduplication
private IEnumerable CleanChunks(IEnumerable chunks)
{
    return chunks
        .Where(c => c.Score >= MinRelevanceScore)
        .OrderByDescending(c => c.Score)
        .GroupBy(c => c.DocumentUri)
        .SelectMany(g => g.Take(1))   // or Take(2) for multi-section documents
        .Take(MaxChunksForSynthesis);
}

This method ensures at most one (or optionally more) chunk per source document is passed downstream, drastically reducing token usage while preserving diversity of sources.

3. LLM Synthesis Call

The selected chunks are concatenated and sent as a prompt to the GPT-4.1-mini model for answer generation.

The improvement cut total chunk count from 8 to 3 and token count from ~19,700 down to ~12,266, yielding significant latency gains:

Chunks retrieved: 3 (down from 8)
Token count sent to synthesis LLM: ~12,266 (down from ~19,700)
Latency: reduced significantly

4. Semantic Cache and Multi-Call Complications

Further instrumentation revealed:

Multiple LLM calls per user query inflate cost and response times.
Certain code paths skipped chunk cleaning, sending huge payloads (62,860 tokens).
Cache hits were hidden due to low logging levels.
Inconsistent KNN query construction thwarted cache effectiveness.

Addressing these underpinning issues is as important as the deduplication fix.

Quick Tips & Tricks

Always Log Token Counts Before Synthesis — Your fastest path to diagnosing latency is understanding how large your model inputs really are.
Deduplicate Chunks By Source Document — Avoid sending multiple paragraphs from the same document to your LLM by grouping and capping chunks per DocumentUri.
Count and Log Every LLM Call Separately — One user query can mask multiple synthesis calls. Instrument to ensure you're hitting your intended call count.
Raise Cache Hit Logs to Information Level — Make semantic cache hits visible in production logs to enable monitoring and tuning.
Align Vector Search and Cache KNN Implementations — Use consistent KNN query methods and score normalization to avoid silently disabling cache hits.
Set Reasonable Chunk Limits (e.g. Take 1 or 2 Per Group) — For multi-section documents, more than one chunk can preserve context without overwhelming token budgets.

Conclusion

Slow RAG pipelines are frequently victims of indiscriminate chunk selection that sends multiple overlapping paragraphs from the same source document into the LLM prompt. This silently inflates token counts and amplifies latency before any model swapping or infrastructure changes should be considered.

Carefully auditing token counts and chunk sources with diagnostic logging is the most reliable first step to uncovering hidden excess. Incorporating grouping by DocumentUri to limit chunks from each document reduces token use significantly while keeping relevant context.

Beyond deduplication, developers must explicitly track synthesis calls and elevate logging around semantic caching to ensure their pipelines remain performant and cost-effective at scale.

Looking ahead, as vector search and LLM integration deepen in applications, these architectural hygiene practices will be essential guardrails for delivering responsive and scalable AI-powered experiences.

References

Your RAG Pipeline Is Slow Because You’re Sending Entire Pages to the LLM – Jamie Maguire