← Back to Guides
8

Series

RAG Engineering in Production· Part 8

GuideFor: AI Engineers, ML Engineers, Platform Engineers, AI Systems Architects

Why Your RAG Pipeline Assembles Context Wrong

Retrieval finds the right document. Context assembly decides whether the LLM reads it. Most teams treat this as a pass-through - and pay for it in accuracy, cost, and latency.

#rag#context-engineering#context-assembly#lost-in-the-middle#token-budget#deduplication#context-compression#production-ai#llm-infrastructure

The retriever returned the right document. It was in the top-5. It contained the answer. The LLM did not use it.

It was at position 4 in a 5-document context, surrounded by tangentially relevant material on both sides. The LLM allocated most of its attention to the documents at positions 1 and 5. Position 4 received less attention - not because the content was wrong, not because the retrieval failed, but because of where the content was placed in the context window.

This is the most expensive failure mode nobody talks about in RAG engineering. You can get every upstream layer right - retrieval strategy (Part 1), chunking (Part 2), embedding (Part 3), reranking (Part 4), evaluation (Part 5), agentic governance (Part 6), index freshness (Part 7) - and still produce wrong answers because the step between retrieval and the LLM prompt was treated as a pass-through.

That step is context assembly: the deliberate architectural decisions about what the model sees, how much of it, in what order, with what structure. Andrej Karpathy named the broader practice in June 2025: "context engineering is the delicate art and science of filling the context window with just the right information for the next step." The LangChain State of Agent Engineering survey of 1,340 respondents found context engineering and managing context at scale were the top-named production challenges. Most documentation still treats it as concatenate-top-k-and-send.

The thesis of this article is specific: most RAG pipelines assemble context in the worst possible way for LLM attention - they concatenate retrieved chunks in retrieval order, add no structure, apply no budget policy, skip deduplication, and never compress. The result is a context window that contains the right information but is organized to make that information hard for the model to use.


Named Concept: The Context Assembly Gap

The Context Assembly Gap is the quality delta between what retrieval found and what the LLM actually processes. A retriever that surfaces the right document and a context assembler that places it at position 10 of 12 in a flat concatenation produces roughly the same output as a retriever that never found it - the LLM's positional attention bias redistributes that document's effective contribution toward zero.

The gap has four components that compound:

Positional degradation. LLMs exhibit a U-shaped attention curve across the context window: strong attention at the beginning and end, significantly weaker attention in the middle. The answer document at position 10 of 20 receives less attention than the same document at position 1 or 20. Concatenating retrieved chunks in retrieval order places the most relevant content wherever the reranker put it - which is often the middle.

Duplication noise. RAG systems commonly retrieve overlapping chunks: the same paragraph from two source documents, adjacent chunks from the same document that were retrieved separately, or contextual retrieval summaries that partially repeat the underlying content. Without deduplication, the model reads the same fact multiple times, consuming extra tokens and potentially over-weighting duplicated information in its generation.

Budget misallocation. The context window is a zero-sum budget shared between system prompt, retrieved documents, conversation history, and the current query. Without explicit allocation policy, the retrieved content expands to fill whatever space is available, crowding out conversation history or forcing the system prompt into a smaller share. At 100:1 input-to-output token ratios, every token in context costs money. Unmanaged budget allocation is unmanaged cost.

Compression failure. When retrieved context exceeds the available token budget, naive pipelines either truncate (drop everything past the limit) or retrieve fewer documents (sacrifice recall). Both are suboptimal. Selective compression - summarizing low-relevance chunks while preserving high-relevance ones verbatim - reduces token consumption while preserving the information the model actually needs.


The Wrong Way: Concatenate and Send

code
# Wrong way: the most common context assembly pattern in production RAGdef assemble_context_naive(query: str, vectorstore, llm) -> str:    """    What most RAG tutorials show, and what most production systems ship with.        Problems:    1. Ordering: chunks returned in retrieval score order, not LLM-attention order.       The top-scored chunk goes first. But if there are 5 chunks, positions 2-4       sit in the middle of the context window - the attention dead zone.           2. No deduplication: if two source documents contain the same paragraph,       both are retrieved and both are passed to the LLM. The model reads       the same sentence twice, paying token cost twice, and potentially       amplifying that content relative to non-duplicated information.           3. No budget policy: the retrieved context is concatenated until complete.       If the system prompt is long, or conversation history is long,       or the 5 retrieved chunks are each 500 tokens, the total context       may exceed the budget the system was designed for. No enforcement.           4. No compression: if the budget is exceeded, truncation drops everything       past the limit. The chunk at position 5 may be partially or fully cut.       The model has no idea the context was truncated.    """    docs = vectorstore.similarity_search(query, k=5)        # Concatenated in retrieval order, no processing    context = "\n\n".join([doc.page_content for doc in docs])        prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"    return llm.invoke(prompt).content

The Right Way: A Four-Stage Context Assembly Pipeline

Context assembly is not a single operation. It is a pipeline with four distinct stages, each addressing a different component of the Context Assembly Gap.

Stage 1: Ordering - Put the Most Relevant Document Where the Model Will Read It

Liu et al. (TACL 2024, arXiv:2307.03172) measured U-shaped attention bias across every frontier model tested. Performance drops over 30% on multi-document question answering when the answer document is at position 10 of 20 compared to position 1 or 20. This is not a model bug. It is an emergent property of Rotary Positional Embedding (RoPE) positional encoding and transformer attention dynamics. It holds across models and context lengths.

The fix is to bookend: place the highest-scored document first, the second-highest last, and fill the middle with supporting context. If you are retrieving five documents, positions 1 and 5 are where the model attends most strongly.

code
from dataclasses import dataclass@dataclassclass ScoredChunk:    text: str    score: float       # Cross-encoder reranker score from Part 4    doc_id: str    source: strdef bookend_ordering(chunks: list[ScoredChunk]) -> list[ScoredChunk]:    """    Place highest-scored chunks at first and last positions.    Fill middle positions with remaining chunks in descending score order.        This directly counteracts the U-shaped attention bias documented by    Liu et al. (TACL 2024): the model attends most strongly to beginning    and end. Putting your best evidence there maximizes the probability    the model uses it.        For k=5 chunks:    Input order (reranker score desc): [A, B, C, D, E]    Bookend order:                     [A, C, D, E, B]    A (best) at position 1, B (second best) at position 5.    C, D, E fill the middle.    """    if not chunks:        return []    if len(chunks) == 1:        return chunks        sorted_chunks = sorted(chunks, key=lambda x: x.score, reverse=True)        best = sorted_chunks[0]    second_best = sorted_chunks[1]    middle = sorted_chunks[2:]        # Bookend: best first, second-best last, rest in middle    return [best] + middle + [second_best]def ordering_for_size(chunks: list[ScoredChunk]) -> list[ScoredChunk]:    """    Alternative: descending relevance for small sets (k <= 3),    bookend for larger sets (k > 3) where mid-context degradation    becomes significant.    """    if len(chunks) <= 3:        return sorted(chunks, key=lambda x: x.score, reverse=True)    return bookend_ordering(chunks)

Stage 2: Deduplication - Remove Redundant Information Before It Reaches the LLM

Duplication in retrieved context wastes tokens and distorts generation by over-weighting facts that happen to appear in multiple sources. Two deduplication strategies cover the production cases.

code
import hashlibimport numpy as npfrom sentence_transformers import SentenceTransformerimport hashlibimport numpy as npfrom sentence_transformers import SentenceTransformerdef exact_dedup(chunks: list[ScoredChunk]) -> list[ScoredChunk]:    """    Hash-based exact deduplication.    Removes verbatim duplicates: same text from two different source documents.    Preserves the higher-scored copy when duplicates are found.        Cost: O(n) - fast, runs before semantic dedup.    """    seen_hashes: set[str] = set()    deduped: list[ScoredChunk] = []        # Sort by score desc so we keep the higher-scored copy on collision    for chunk in sorted(chunks, key=lambda x: x.score, reverse=True):        content_hash = hashlib.md5(chunk.text.strip().encode()).hexdigest()        if content_hash not in seen_hashes:            seen_hashes.add(content_hash)            deduped.append(chunk)        return dedupeddef semantic_dedup(    chunks: list[ScoredChunk],    embed_model: SentenceTransformer,    similarity_threshold: float = 0.92,) -> list[ScoredChunk]:    """    Embedding-based semantic deduplication.    Removes near-duplicate chunks: same content paraphrased or slightly reformatted.        Threshold calibration:      0.95+: only near-verbatim duplicates removed      0.90-0.94: removes close paraphrases (recommended range)      0.85-0.89: aggressive; may remove genuinely distinct chunks on the same topic        Cost: one embedding call per chunk. Run exact dedup first to reduce n.    """    if len(chunks) <= 1:        return chunks        texts = [c.text for c in chunks]    embeddings = embed_model.encode(texts, normalize_embeddings=True)        kept_indices: list[int] = []    dropped: set[int] = set()        # Greedy: keep highest-scored; drop anything too similar to what we kept    sorted_indices = sorted(        range(len(chunks)),        key=lambda i: chunks[i].score,        reverse=True,    )        for idx in sorted_indices:        if idx in dropped:            continue                kept_indices.append(idx)                # Mark similar chunks as dropped        for other_idx in sorted_indices:            if other_idx == idx or other_idx in dropped:                continue            sim = float(np.dot(embeddings[idx], embeddings[other_idx]))            if sim >= similarity_threshold:                dropped.add(other_idx)        return [chunks[i] for i in sorted(kept_indices)]def dedup_pipeline(    chunks: list[ScoredChunk],    embed_model: SentenceTransformer,    semantic_threshold: float = 0.92,) -> list[ScoredChunk]:    """    Run exact dedup first (fast), then semantic dedup on the remainder.    """    after_exact = exact_dedup(chunks)    return semantic_dedup(after_exact, embed_model, semantic_threshold)

Stage 3: Budget Allocation - Make Context a Governed Resource

The context window is a zero-sum budget. Every token allocated to retrieved documents is a token not available for conversation history, tool outputs, or system instructions. Without an explicit allocation policy, dynamic content expands until something gets truncated - usually the content that arrives last in the assembly order.

code
import tiktokenfrom dataclasses import dataclass, field@dataclassclass ContextBudget:    """    Explicit token budget for each context zone.    Total must be less than model max_tokens to leave room for generation.        Recommended starting allocation for a 128K model (Claude Sonnet 4.6):      system_prompt:    4,000 tokens  (stable instructions)      retrieved_docs:  32,000 tokens  (retrieved context)      conversation:    8,000 tokens   (rolling history)      query:           1,000 tokens   (current query)      generation:      4,000 tokens   (output budget)      buffer:          1,000 tokens   (safety margin)        Adjust per use case:      - High-context analytical queries: increase retrieved_docs, reduce conversation      - Customer support chat: increase conversation, reduce retrieved_docs      - Single-turn Q&A: set conversation to 0    """    system_prompt_tokens: int = 4_000    retrieved_docs_tokens: int = 32_000    conversation_tokens: int = 8_000    query_tokens: int = 1_000    generation_tokens: int = 4_000    buffer_tokens: int = 1_000        @property    def total(self) -> int:        return (            self.system_prompt_tokens            + self.retrieved_docs_tokens            + self.conversation_tokens            + self.query_tokens            + self.generation_tokens            + self.buffer_tokens        )def count_tokens(text: str, model: str = "gpt-4o") -> int:    """Count tokens using the model's own tokenizer."""    enc = tiktoken.encoding_for_model(model)    return len(enc.encode(text))def fit_chunks_to_budget(    chunks: list[ScoredChunk],    token_budget: int,    model: str = "gpt-4o",) -> list[ScoredChunk]:    """    Select chunks that fit within the retrieved_docs token budget.        Strategy: include chunks greedily by score until budget is exhausted.    A chunk that does not fully fit is dropped, not truncated mid-sentence.    Truncating mid-chunk is worse than excluding the chunk entirely -    the model receives an incomplete semantic unit.        Log when chunks are excluded: this is a signal that either the budget    is too tight or the retrieval is returning too many large chunks.    """    fitted: list[ScoredChunk] = []    tokens_used = 0    excluded = []        for chunk in chunks:  # Already ordered by bookend_ordering        chunk_tokens = count_tokens(chunk.text, model)                if tokens_used + chunk_tokens <= token_budget:            fitted.append(chunk)            tokens_used += chunk_tokens        else:            excluded.append((chunk.doc_id, chunk_tokens))        if excluded:        print(            f"Budget: excluded {len(excluded)} chunks "            f"({sum(t for _, t in excluded)} tokens) from context. "            f"Used {tokens_used}/{token_budget} tokens."        )        return fitted

Stage 4: Compression - Preserve Signal When Budget Is Tight

When the token budget is genuinely insufficient to include all relevant chunks verbatim, selective compression preserves the information the model needs without the token cost of full inclusion.

code
from anthropic import Anthropicclient = Anthropic()def compress_chunk(chunk: ScoredChunk, target_tokens: int) -> ScoredChunk:    """    Compress a chunk to approximately target_tokens using a fast LLM.        Use for chunks that are relevant but exceed the per-chunk budget.    Preserve the factual content; strip narrative structure and examples.        When to compress vs drop:    - Compress: chunk score above relevance floor, but chunk too large    - Drop: chunk score below relevance floor (established in Part 4)        Note: compression introduces an LLM call per chunk.    Use only when budget is genuinely tight, not as a default operation.    """    prompt = f"""Compress the following text to approximately {target_tokens} tokens.Preserve all specific facts, numbers, dates, names, and technical details.Remove examples, analogies, and repetitive explanations.Output only the compressed text, nothing else.Text to compress:{chunk.text}"""        response = client.messages.create(        model="claude-haiku-4-5-20251001",  # Fast model for compression        max_tokens=target_tokens + 100,     # Slight headroom        messages=[{"role": "user", "content": prompt}]    )        compressed_text = response.content[0].text    return ScoredChunk(        text=compressed_text,        score=chunk.score,        doc_id=chunk.doc_id,        source=chunk.source,    )def selective_compress(    chunks: list[ScoredChunk],    token_budget: int,    relevance_floor: float = 0.0,    max_chunk_tokens: int = 2_000,    model: str = "gpt-4o",) -> list[ScoredChunk]:    """    Selectively compress chunks that exceed max_chunk_tokens    while leaving high-relevance, appropriately-sized chunks untouched.        Order of operations:    1. Drop chunks below relevance_floor entirely    2. Compress chunks above max_chunk_tokens to max_chunk_tokens    3. Return compressed + untouched chunks within budget    """    filtered = [c for c in chunks if c.score >= relevance_floor]        result: list[ScoredChunk] = []    tokens_used = 0        for chunk in filtered:        chunk_tokens = count_tokens(chunk.text, model)                if chunk_tokens > max_chunk_tokens:            # Compress: too large to include verbatim            compressed = compress_chunk(chunk, target_tokens=max_chunk_tokens)            compressed_tokens = count_tokens(compressed.text, model)                        if tokens_used + compressed_tokens <= token_budget:                result.append(compressed)                tokens_used += compressed_tokens        else:            if tokens_used + chunk_tokens <= token_budget:                result.append(chunk)                tokens_used += chunk_tokens        return result

The Complete Context Assembly Pipeline

All four stages compose into a single assembly function called after retrieval and reranking.

code
def assemble_context(    query: str,    reranked_chunks: list[ScoredChunk],    budget: ContextBudget,    embed_model: SentenceTransformer,    citation_format: bool = True,    model: str = "gpt-4o",) -> tuple[str, list[str]]:    """    Four-stage context assembly pipeline.        Returns (assembled_context_string, citation_list)        Stage 1: Deduplication - remove redundant information    Stage 2: Budget fitting - select chunks within token budget    Stage 3: Compression  - compress oversized chunks if needed    Stage 4: Ordering     - bookend for LLM attention optimization        call after reranking (Part 4), before prompt construction.    """    if not reranked_chunks:        return "", []        # Stage 1: Dedup    deduped = dedup_pipeline(reranked_chunks, embed_model)        # Stage 2 + 3: Fit to budget with selective compression    compressed = selective_compress(        deduped,        token_budget=budget.retrieved_docs_tokens,        model=model,    )        # Stage 4: Reorder for attention optimization    ordered = ordering_for_size(compressed)        # Format with citations if requested    if citation_format:        formatted_chunks = []        citations = []                for i, chunk in enumerate(ordered, 1):            formatted_chunks.append(                f"[{i}] Source: {chunk.source}\n{chunk.text}"            )            citations.append(chunk.source)                context = "\n\n---\n\n".join(formatted_chunks)        return context, citations    else:        context = "\n\n".join([c.text for c in ordered])        return context, [c.source for c in ordered]def rag_with_context_assembly(    query: str,    retriever,    reranker,    embed_model: SentenceTransformer,    llm,    budget: ContextBudget,) -> dict:    """    Full RAG pipeline with context assembly as an explicit stage.        Parts 1-4 (retrieval, chunking, embedding, reranking) produce the candidates.    Context assembly (Part 8) determines what the LLM actually reads.    """    # Parts 1-4: retrieve and rerank    raw_docs = retriever.invoke(query)        pairs = [(query, doc.page_content) for doc in raw_docs]    rerank_scores = reranker.predict(pairs)        scored_chunks = [        ScoredChunk(            text=doc.page_content,            score=float(score),            doc_id=doc.metadata.get("doc_id", f"doc_{i}"),            source=doc.metadata.get("source", "unknown"),        )        for i, (doc, score) in enumerate(zip(raw_docs, rerank_scores))    ]        # Part 8: context assembly    context, citations = assemble_context(        query=query,        reranked_chunks=scored_chunks,        budget=budget,        embed_model=embed_model,    )        if not context:        return {"answer": "No relevant content found.", "citations": []}        prompt = f"""Use the following sources to answer the question.Cite sources inline using [1], [2] notation.{context}Question: {query}Answer:"""        answer = llm.invoke(prompt).content        return {        "answer": answer,        "citations": citations,        "chunks_retrieved": len(raw_docs),        "chunks_after_dedup_and_budget": len(reranked_chunks),    }

The Context Assembly Decision Guide

mermaid
flowchart TD
    A[Reranked chunks ready] --> B{k greater than 3?\nMultiple chunks to assemble}
    B -- No --> C[Relevance-descending order\nno bookending needed]
    B -- Yes --> D[Bookend ordering\nbest first best-1 last\nmiddle fills center]
    C --> E{Semantic duplicates\npresent in candidates?}
    D --> E
    E -- Yes --> F[Exact dedup then\nsemantic dedup\nthreshold 0.90-0.94]
    E -- No --> G[Skip dedup\nsave latency]
    F --> H{Total chunk tokens\nvs retrieved_docs budget}
    G --> H
    H -- Within budget --> I[Include all verbatim\nno compression needed]
    H -- Over budget --> J{Chunks over budget\nare above relevance floor?}
    J -- Yes --> K[Selective compression\ncompress oversized chunks\nto max_chunk_tokens]
    J -- No --> L[Drop below-floor chunks\ninclude rest verbatim]
    K --> M[Citation formatting\nnumbered refs in context\nenables inline citation]
    I --> M
    L --> M
    M --> N[Assembled context\nready for prompt]

    style A fill:#4A90E2,color:#fff
    style B fill:#7B68EE,color:#fff
    style C fill:#6BCF7F,color:#fff
    style D fill:#6BCF7F,color:#fff
    style E fill:#7B68EE,color:#fff
    style F fill:#98D8C8,color:#333
    style G fill:#6BCF7F,color:#fff
    style H fill:#7B68EE,color:#fff
    style I fill:#6BCF7F,color:#fff
    style J fill:#7B68EE,color:#fff
    style K fill:#FFD93D,color:#333
    style L fill:#FFA07A,color:#333
    style M fill:#98D8C8,color:#333
    style N fill:#6BCF7F,color:#fff

How the Context Assembly Gap Compounds Prior Failure Modes

Context assembly is the final layer in a pipeline where every upstream failure mode reduces the signal available for assembly. But it also introduces failure modes that are independent of retrieval quality.

The Precision Gap from Part 4 directly sets the input quality for context assembly. A reranker that places the correct document at position 1 hands the assembler a strong input. A reranker with a high Precision Gap hands the assembler a weaker input. But even a high-quality reranked list is subject to positional degradation if the assembler concatenates in reranker order and places the best document at position 3 of 5.

The Evals Blind Spot from Part 5 extends to context assembly. RAGAS context precision measures whether retrieved chunks are relevant. It does not measure whether the assembled context is ordered to maximize LLM attention on the most relevant chunk. You can have context precision of 1.0 (all five retrieved chunks are relevant) and still produce wrong answers because the most relevant chunk is at position 3 of 5 and the LLM attends to positions 1 and 5.

The Orchestration Overhead from Part 6 is directly reduced by compression. Every token removed from context by selective compression is a token not paid for at inference time across every agent iteration. In agentic loops that retrieve three to four times, a 30% reduction in per-retrieval context size compounds across iterations.

The Staleness Gap from Part 7 interacts with deduplication. When an updated document and its prior version are both in the index during the staleness window, both may be retrieved and both may pass the relevance threshold. Deduplication will detect them as near-duplicates and keep the higher-scored one. This is the correct behavior - but the "higher-scored" version may be the stale one if the new version was added to the index more recently and the embedding has not yet stabilized for the new content.


The Context Assembly Checklist

Ordering:

  • Bookend ordering applied for k > 3 retrieved chunks: highest score at position 1, second highest at position k, remainder filling middle
  • For k <= 3: descending relevance order is sufficient

Deduplication:

  • Exact deduplication running before semantic deduplication (hash-based, fast)
  • Semantic deduplication threshold calibrated on your corpus: 0.90-0.94 recommended
  • Deduplication applied after reranking but before budget fitting

Budget allocation:

  • Explicit token budget defined per context zone: system prompt, retrieved docs, conversation, query, generation, buffer
  • Budget enforced programmatically using the model's tokenizer (not character count heuristics)
  • Chunk exclusion logged when budget is exceeded: high exclusion rate signals either tight budget or oversized chunks
  • For multi-turn conversations: conversation history compressed after N turns to prevent unbounded growth

Compression:

  • Selective compression applied only when budget is tight - not as default operation
  • Chunks below relevance floor dropped before compression is considered (Part 4 threshold)
  • Compression uses a fast, cheap model (Haiku, mini) not the primary generation model
  • Compressed chunks flagged in metadata for monitoring: what fraction of context is compressed vs verbatim?

Citation formatting:

  • Numbered citation format applied: [1] Source: ...\n{text} enables inline citation in generation
  • Source document title or URL included in each citation block
  • System prompt instructs the model to cite inline using [1], [2] notation

The Complete Eight-Layer Diagnostic Framework

Eight parts, eight named failure modes, one complete diagnostic vocabulary for every layer of the production RAG pipeline:

LayerNamed ConceptRoot CausePrimary Metric
Retrieval strategyThe Retrieval TaxWrong backend for query typenDCG@10 before/after routing
ChunkingChunking DebtBad splits at ingestionContext recall below 0.8
EmbeddingSemantic Compression LossDomain terms compressed to wrong proxiesRecall gap on domain eval set
RerankingThe Precision GapBi-encoder rank errornDCG@10 pre vs post reranker
EvaluationThe Evals Blind SpotAnswer quality measured; retrieval quality ignoredContext recall on golden dataset
Agent governanceThe Orchestration OverheadLoop cost with no ceilingAvg iterations + tokens per session
Index freshnessThe Staleness GapIndex snapshot diverges from sourceStale fraction in metadata scan
Context assemblyThe Context Assembly GapRetrieved signal destroyed by ordering, duplication, budget failureAnswer quality delta pre/post assembly

The diagnostic framework is complete. Every production RAG failure traces to one of these eight layers. Each has a named failure mode, a primary measurement, and a specific fix in the corresponding part of this series.


References


Retrieval Augmented Generation

Follow for more technical deep dives on AI/ML systems, production engineering, and building real-world applications:


Comments