Why Your RAG Pipeline Assembles Context Wrong

The retriever returned the right document. It was in the top-5. It contained the answer. The LLM did not use it.

It was at position 4 in a 5-document context, surrounded by tangentially relevant material on both sides. The LLM allocated most of its attention to the documents at positions 1 and 5. Position 4 received less attention - not because the content was wrong, not because the retrieval failed, but because of where the content was placed in the context window.

This is the most expensive failure mode nobody talks about in RAG engineering. You can get every upstream layer right - retrieval strategy (Part 1), chunking (Part 2), embedding (Part 3), reranking (Part 4), evaluation (Part 5), agentic governance (Part 6), index freshness (Part 7) - and still produce wrong answers because the step between retrieval and the LLM prompt was treated as a pass-through.

That step is context assembly: the deliberate architectural decisions about what the model sees, how much of it, in what order, with what structure. Andrej Karpathy named the broader practice in June 2025: "context engineering is the delicate art and science of filling the context window with just the right information for the next step." The LangChain State of Agent Engineering survey of 1,340 respondents found context engineering and managing context at scale were the top-named production challenges. Most documentation still treats it as concatenate-top-k-and-send.

The thesis of this article is specific: most RAG pipelines assemble context in the worst possible way for LLM attention - they concatenate retrieved chunks in retrieval order, add no structure, apply no budget policy, skip deduplication, and never compress. The result is a context window that contains the right information but is organized to make that information hard for the model to use.

Named Concept: The Context Assembly Gap

The Context Assembly Gap is the quality delta between what retrieval found and what the LLM actually processes. A retriever that surfaces the right document and a context assembler that places it at position 10 of 12 in a flat concatenation produces roughly the same output as a retriever that never found it - the LLM's positional attention bias redistributes that document's effective contribution toward zero.

The gap has four components that compound:

Positional degradation. LLMs exhibit a U-shaped attention curve across the context window: strong attention at the beginning and end, significantly weaker attention in the middle. The answer document at position 10 of 20 receives less attention than the same document at position 1 or 20. Concatenating retrieved chunks in retrieval order places the most relevant content wherever the reranker put it - which is often the middle.

Duplication noise. RAG systems commonly retrieve overlapping chunks: the same paragraph from two source documents, adjacent chunks from the same document that were retrieved separately, or contextual retrieval summaries that partially repeat the underlying content. Without deduplication, the model reads the same fact multiple times, consuming extra tokens and potentially over-weighting duplicated information in its generation.

Budget misallocation. The context window is a zero-sum budget shared between system prompt, retrieved documents, conversation history, and the current query. Without explicit allocation policy, the retrieved content expands to fill whatever space is available, crowding out conversation history or forcing the system prompt into a smaller share. At 100:1 input-to-output token ratios, every token in context costs money. Unmanaged budget allocation is unmanaged cost.

Compression failure. When retrieved context exceeds the available token budget, naive pipelines either truncate (drop everything past the limit) or retrieve fewer documents (sacrifice recall). Both are suboptimal. Selective compression - summarizing low-relevance chunks while preserving high-relevance ones verbatim - reduces token consumption while preserving the information the model actually needs.

The Wrong Way: Concatenate and Send

code

# Wrong way: the most common context assembly pattern in production RAGdef assemble_context_naive(query: str, vectorstore, llm) -> str:    """    What most RAG tutorials show, and what most production systems ship with.        Problems:    1. Ordering: chunks returned in retrieval score order, not LLM-attention order.       The top-scored chunk goes first. But if there are 5 chunks, positions 2-4       sit in the middle of the context window - the attention dead zone.           2. No deduplication: if two source documents contain the same paragraph,       both are retrieved and both are passed to the LLM. The model reads       the same sentence twice, paying token cost twice, and potentially       amplifying that content relative to non-duplicated information.           3. No budget policy: the retrieved context is concatenated until complete.       If the system prompt is long, or conversation history is long,       or the 5 retrieved chunks are each 500 tokens, the total context       may exceed the budget the system was designed for. No enforcement.           4. No compression: if the budget is exceeded, truncation drops everything       past the limit. The chunk at position 5 may be partially or fully cut.       The model has no idea the context was truncated.    """    docs = vectorstore.similarity_search(query, k=5)        # Concatenated in retrieval order, no processing    context = "\n\n".join([doc.page_content for doc in docs])        prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"    return llm.invoke(prompt).content

The Right Way: A Four-Stage Context Assembly Pipeline

Context assembly is not a single operation. It is a pipeline with four distinct stages, each addressing a different component of the Context Assembly Gap.

Stage 1: Ordering - Put the Most Relevant Document Where the Model Will Read It

Liu et al. (TACL 2024, arXiv:2307.03172) measured U-shaped attention bias across every frontier model tested. Performance drops over 30% on multi-document question answering when the answer document is at position 10 of 20 compared to position 1 or 20. This is not a model bug. It is an emergent property of Rotary Positional Embedding (RoPE) positional encoding and transformer attention dynamics. It holds across models and context lengths.

The fix is to bookend: place the highest-scored document first, the second-highest last, and fill the middle with supporting context. If you are retrieving five documents, positions 1 and 5 are where the model attends most strongly.

code

from dataclasses import dataclass@dataclassclass ScoredChunk:    text: str    score: float       # Cross-encoder reranker score from Part 4    doc_id: str    source: strdef bookend_ordering(chunks: list[ScoredChunk]) -> list[ScoredChunk]:    """    Place highest-scored chunks at first and last positions.    Fill middle positions with remaining chunks in descending score order.        This directly counteracts the U-shaped attention bias documented by    Liu et al. (TACL 2024): the model attends most strongly to beginning    and end. Putting your best evidence there maximizes the probability    the model uses it.        For k=5 chunks:    Input order (reranker score desc): [A, B, C, D, E]    Bookend order:                     [A, C, D, E, B]    A (best) at position 1, B (second best) at position 5.    C, D, E fill the middle.    """    if not chunks:        return []    if len(chunks) == 1:        return chunks        sorted_chunks = sorted(chunks, key=lambda x: x.score, reverse=True)        best = sorted_chunks[0]    second_best = sorted_chunks[1]    middle = sorted_chunks[2:]        # Bookend: best first, second-best last, rest in middle    return [best] + middle + [second_best]def ordering_for_size(chunks: list[ScoredChunk]) -> list[ScoredChunk]:    """    Alternative: descending relevance for small sets (k <= 3),    bookend for larger sets (k > 3) where mid-context degradation    becomes significant.    """    if len(chunks) <= 3:        return sorted(chunks, key=lambda x: x.score, reverse=True)    return bookend_ordering(chunks)

Stage 2: Deduplication - Remove Redundant Information Before It Reaches the LLM

Duplication in retrieved context wastes tokens and distorts generation by over-weighting facts that happen to appear in multiple sources. Two deduplication strategies cover the production cases.

code

import hashlibimport numpy as npfrom sentence_transformers import SentenceTransformerimport hashlibimport numpy as npfrom sentence_transformers import SentenceTransformerdef exact_dedup(chunks: list[ScoredChunk]) -> list[ScoredChunk]:    """    Hash-based exact deduplication.    Removes verbatim duplicates: same text from two different source documents.    Preserves the higher-scored copy when duplicates are found.        Cost: O(n) - fast, runs before semantic dedup.    """    seen_hashes: set[str] = set()    deduped: list[ScoredChunk] = []        # Sort by score desc so we keep the higher-scored copy on collision    for chunk in sorted(chunks, key=lambda x: x.score, reverse=True):        content_hash = hashlib.md5(chunk.text.strip().encode()).hexdigest()        if content_hash not in seen_hashes:            seen_hashes.add(content_hash)            deduped.append(chunk)        return dedupeddef semantic_dedup(    chunks: list[ScoredChunk],    embed_model: SentenceTransformer,    similarity_threshold: float = 0.92,) -> list[ScoredChunk]:    """    Embedding-based semantic deduplication.    Removes near-duplicate chunks: same content paraphrased or slightly reformatted.        Threshold calibration:      0.95+: only near-verbatim duplicates removed      0.90-0.94: removes close paraphrases (recommended range)      0.85-0.89: aggressive; may remove genuinely distinct chunks on the same topic        Cost: one embedding call per chunk. Run exact dedup first to reduce n.    """    if len(chunks) <= 1:        return chunks        texts = [c.text for c in chunks]    embeddings = embed_model.encode(texts, normalize_embeddings=True)        kept_indices: list[int] = []    dropped: set[int] = set()        # Greedy: keep highest-scored; drop anything too similar to what we kept    sorted_indices = sorted(        range(len(chunks)),        key=lambda i: chunks[i].score,        reverse=True,    )        for idx in sorted_indices:        if idx in dropped:            continue                kept_indices.append(idx)                # Mark similar chunks as dropped        for other_idx in sorted_indices:            if other_idx == idx or other_idx in dropped:                continue            sim = float(np.dot(embeddings[idx], embeddings[other_idx]))            if sim >= similarity_threshold:                dropped.add(other_idx)        return [chunks[i] for i in sorted(kept_indices)]def dedup_pipeline(    chunks: list[ScoredChunk],    embed_model: SentenceTransformer,    semantic_threshold: float = 0.92,) -> list[ScoredChunk]:    """    Run exact dedup first (fast), then semantic dedup on the remainder.    """    after_exact = exact_dedup(chunks)    return semantic_dedup(after_exact, embed_model, semantic_threshold)

Stage 3: Budget Allocation - Make Context a Governed Resource

The context window is a zero-sum budget. Every token allocated to retrieved documents is a token not available for conversation history, tool outputs, or system instructions. Without an explicit allocation policy, dynamic content expands until something gets truncated - usually the content that arrives last in the assembly order.

code

import tiktokenfrom dataclasses import dataclass, field@dataclassclass ContextBudget:    """    Explicit token budget for each context zone.    Total must be less than model max_tokens to leave room for generation.        Recommended starting allocation for a 128K model (Claude Sonnet 4.6):      system_prompt:    4,000 tokens  (stable instructions)      retrieved_docs:  32,000 tokens  (retrieved context)      conversation:    8,000 tokens   (rolling history)      query:           1,000 tokens   (current query)      generation:      4,000 tokens   (output budget)      buffer:          1,000 tokens   (safety margin)        Adjust per use case:      - High-context analytical queries: increase retrieved_docs, reduce conversation      - Customer support chat: increase conversation, reduce retrieved_docs      - Single-turn Q&A: set conversation to 0    """    system_prompt_tokens: int = 4_000    retrieved_docs_tokens: int = 32_000    conversation_tokens: int = 8_000    query_tokens: int = 1_000    generation_tokens: int = 4_000    buffer_tokens: int = 1_000        @property    def total(self) -> int:        return (            self.system_prompt_tokens            + self.retrieved_docs_tokens            + self.conversation_tokens            + self.query_tokens            + self.generation_tokens            + self.buffer_tokens        )def count_tokens(text: str, model: str = "gpt-4o") -> int:    """Count tokens using the model's own tokenizer."""    enc = tiktoken.encoding_for_model(model)    return len(enc.encode(text))def fit_chunks_to_budget(    chunks: list[ScoredChunk],    token_budget: int,    model: str = "gpt-4o",) -> list[ScoredChunk]:    """    Select chunks that fit within the retrieved_docs token budget.        Strategy: include chunks greedily by score until budget is exhausted.    A chunk that does not fully fit is dropped, not truncated mid-sentence.    Truncating mid-chunk is worse than excluding the chunk entirely -    the model receives an incomplete semantic unit.        Log when chunks are excluded: this is a signal that either the budget    is too tight or the retrieval is returning too many large chunks.    """    fitted: list[ScoredChunk] = []    tokens_used = 0    excluded = []        for chunk in chunks:  # Already ordered by bookend_ordering        chunk_tokens = count_tokens(chunk.text, model)                if tokens_used + chunk_tokens <= token_budget:            fitted.append(chunk)            tokens_used += chunk_tokens        else:            excluded.append((chunk.doc_id, chunk_tokens))        if excluded:        print(            f"Budget: excluded {len(excluded)} chunks "            f"({sum(t for _, t in excluded)} tokens) from context. "            f"Used {tokens_used}/{token_budget} tokens."        )        return fitted

Stage 4: Compression - Preserve Signal When Budget Is Tight

When the token budget is genuinely insufficient to include all relevant chunks verbatim, selective compression preserves the information the model needs without the token cost of full inclusion.

code

from anthropic import Anthropicclient = Anthropic()def compress_chunk(chunk: ScoredChunk, target_tokens: int) -> ScoredChunk:    """    Compress a chunk to approximately target_tokens using a fast LLM.        Use for chunks that are relevant but exceed the per-chunk budget.    Preserve the factual content; strip narrative structure and examples.        When to compress vs drop:    - Compress: chunk score above relevance floor, but chunk too large    - Drop: chunk score below relevance floor (established in Part 4)        Note: compression introduces an LLM call per chunk.    Use only when budget is genuinely tight, not as a default operation.    """    prompt = f"""Compress the following text to approximately {target_tokens} tokens.Preserve all specific facts, numbers, dates, names, and technical details.Remove examples, analogies, and repetitive explanations.Output only the compressed text, nothing else.Text to compress:{chunk.text}"""        response = client.messages.create(        model="claude-haiku-4-5-20251001",  # Fast model for compression        max_tokens=target_tokens + 100,     # Slight headroom        messages=[{"role": "user", "content": prompt}]    )        compressed_text = response.content[0].text    return ScoredChunk(        text=compressed_text,        score=chunk.score,        doc_id=chunk.doc_id,        source=chunk.source,    )def selective_compress(    chunks: list[ScoredChunk],    token_budget: int,    relevance_floor: float = 0.0,    max_chunk_tokens: int = 2_000,    model: str = "gpt-4o",) -> list[ScoredChunk]:    """    Selectively compress chunks that exceed max_chunk_tokens    while leaving high-relevance, appropriately-sized chunks untouched.        Order of operations:    1. Drop chunks below relevance_floor entirely    2. Compress chunks above max_chunk_tokens to max_chunk_tokens    3. Return compressed + untouched chunks within budget    """    filtered = [c for c in chunks if c.score >= relevance_floor]        result: list[ScoredChunk] = []    tokens_used = 0        for chunk in filtered:        chunk_tokens = count_tokens(chunk.text, model)                if chunk_tokens > max_chunk_tokens:            # Compress: too large to include verbatim            compressed = compress_chunk(chunk, target_tokens=max_chunk_tokens)            compressed_tokens = count_tokens(compressed.text, model)                        if tokens_used + compressed_tokens <= token_budget:                result.append(compressed)                tokens_used += compressed_tokens        else:            if tokens_used + chunk_tokens <= token_budget:                result.append(chunk)                tokens_used += chunk_tokens        return result

The Complete Context Assembly Pipeline

All four stages compose into a single assembly function called after retrieval and reranking.

code

def assemble_context(    query: str,    reranked_chunks: list[ScoredChunk],    budget: ContextBudget,    embed_model: SentenceTransformer,    citation_format: bool = True,    model: str = "gpt-4o",) -> tuple[str, list[str]]:    """    Four-stage context assembly pipeline.        Returns (assembled_context_string, citation_list)        Stage 1: Deduplication - remove redundant information    Stage 2: Budget fitting - select chunks within token budget    Stage 3: Compression  - compress oversized chunks if needed    Stage 4: Ordering     - bookend for LLM attention optimization        call after reranking (Part 4), before prompt construction.    """    if not reranked_chunks:        return "", []        # Stage 1: Dedup    deduped = dedup_pipeline(reranked_chunks, embed_model)        # Stage 2 + 3: Fit to budget with selective compression    compressed = selective_compress(        deduped,        token_budget=budget.retrieved_docs_tokens,        model=model,    )        # Stage 4: Reorder for attention optimization    ordered = ordering_for_size(compressed)        # Format with citations if requested    if citation_format:        formatted_chunks = []        citations = []                for i, chunk in enumerate(ordered, 1):            formatted_chunks.append(                f"[{i}] Source: {chunk.source}\n{chunk.text}"            )            citations.append(chunk.source)                context = "\n\n---\n\n".join(formatted_chunks)        return context, citations    else:        context = "\n\n".join([c.text for c in ordered])        return context, [c.source for c in ordered]def rag_with_context_assembly(    query: str,    retriever,    reranker,    embed_model: SentenceTransformer,    llm,    budget: ContextBudget,) -> dict:    """    Full RAG pipeline with context assembly as an explicit stage.        Parts 1-4 (retrieval, chunking, embedding, reranking) produce the candidates.    Context assembly (Part 8) determines what the LLM actually reads.    """    # Parts 1-4: retrieve and rerank    raw_docs = retriever.invoke(query)        pairs = [(query, doc.page_content) for doc in raw_docs]    rerank_scores = reranker.predict(pairs)        scored_chunks = [        ScoredChunk(            text=doc.page_content,            score=float(score),            doc_id=doc.metadata.get("doc_id", f"doc_{i}"),            source=doc.metadata.get("source", "unknown"),        )        for i, (doc, score) in enumerate(zip(raw_docs, rerank_scores))    ]        # Part 8: context assembly    context, citations = assemble_context(        query=query,        reranked_chunks=scored_chunks,        budget=budget,        embed_model=embed_model,    )        if not context:        return {"answer": "No relevant content found.", "citations": []}        prompt = f"""Use the following sources to answer the question.Cite sources inline using [1], [2] notation.{context}Question: {query}Answer:"""        answer = llm.invoke(prompt).content        return {        "answer": answer,        "citations": citations,        "chunks_retrieved": len(raw_docs),        "chunks_after_dedup_and_budget": len(reranked_chunks),    }

The Context Assembly Decision Guide

mermaid

flowchart TD
    A[Reranked chunks ready] --> B{k greater than 3?\nMultiple chunks to assemble}
    B -- No --> C[Relevance-descending order\nno bookending needed]
    B -- Yes --> D[Bookend ordering\nbest first best-1 last\nmiddle fills center]
    C --> E{Semantic duplicates\npresent in candidates?}
    D --> E
    E -- Yes --> F[Exact dedup then\nsemantic dedup\nthreshold 0.90-0.94]
    E -- No --> G[Skip dedup\nsave latency]
    F --> H{Total chunk tokens\nvs retrieved_docs budget}
    G --> H
    H -- Within budget --> I[Include all verbatim\nno compression needed]
    H -- Over budget --> J{Chunks over budget\nare above relevance floor?}
    J -- Yes --> K[Selective compression\ncompress oversized chunks\nto max_chunk_tokens]
    J -- No --> L[Drop below-floor chunks\ninclude rest verbatim]
    K --> M[Citation formatting\nnumbered refs in context\nenables inline citation]
    I --> M
    L --> M
    M --> N[Assembled context\nready for prompt]

    style A fill:#4A90E2,color:#fff
    style B fill:#7B68EE,color:#fff
    style C fill:#6BCF7F,color:#fff
    style D fill:#6BCF7F,color:#fff
    style E fill:#7B68EE,color:#fff
    style F fill:#98D8C8,color:#333
    style G fill:#6BCF7F,color:#fff
    style H fill:#7B68EE,color:#fff
    style I fill:#6BCF7F,color:#fff
    style J fill:#7B68EE,color:#fff
    style K fill:#FFD93D,color:#333
    style L fill:#FFA07A,color:#333
    style M fill:#98D8C8,color:#333
    style N fill:#6BCF7F,color:#fff

How the Context Assembly Gap Compounds Prior Failure Modes

Context assembly is the final layer in a pipeline where every upstream failure mode reduces the signal available for assembly. But it also introduces failure modes that are independent of retrieval quality.

The Precision Gap from Part 4 directly sets the input quality for context assembly. A reranker that places the correct document at position 1 hands the assembler a strong input. A reranker with a high Precision Gap hands the assembler a weaker input. But even a high-quality reranked list is subject to positional degradation if the assembler concatenates in reranker order and places the best document at position 3 of 5.

The Evals Blind Spot from Part 5 extends to context assembly. RAGAS context precision measures whether retrieved chunks are relevant. It does not measure whether the assembled context is ordered to maximize LLM attention on the most relevant chunk. You can have context precision of 1.0 (all five retrieved chunks are relevant) and still produce wrong answers because the most relevant chunk is at position 3 of 5 and the LLM attends to positions 1 and 5.

The Orchestration Overhead from Part 6 is directly reduced by compression. Every token removed from context by selective compression is a token not paid for at inference time across every agent iteration. In agentic loops that retrieve three to four times, a 30% reduction in per-retrieval context size compounds across iterations.

The Staleness Gap from Part 7 interacts with deduplication. When an updated document and its prior version are both in the index during the staleness window, both may be retrieved and both may pass the relevance threshold. Deduplication will detect them as near-duplicates and keep the higher-scored one. This is the correct behavior - but the "higher-scored" version may be the stale one if the new version was added to the index more recently and the embedding has not yet stabilized for the new content.

The Context Assembly Checklist

Ordering:

Bookend ordering applied for k > 3 retrieved chunks: highest score at position 1, second highest at position k, remainder filling middle
For k <= 3: descending relevance order is sufficient

Deduplication:

Exact deduplication running before semantic deduplication (hash-based, fast)
Semantic deduplication threshold calibrated on your corpus: 0.90-0.94 recommended
Deduplication applied after reranking but before budget fitting

Budget allocation:

Explicit token budget defined per context zone: system prompt, retrieved docs, conversation, query, generation, buffer
Budget enforced programmatically using the model's tokenizer (not character count heuristics)
Chunk exclusion logged when budget is exceeded: high exclusion rate signals either tight budget or oversized chunks
For multi-turn conversations: conversation history compressed after N turns to prevent unbounded growth

Compression:

Selective compression applied only when budget is tight - not as default operation
Chunks below relevance floor dropped before compression is considered (Part 4 threshold)
Compression uses a fast, cheap model (Haiku, mini) not the primary generation model
Compressed chunks flagged in metadata for monitoring: what fraction of context is compressed vs verbatim?

Citation formatting:

Numbered citation format applied: [1] Source: ...\n{text} enables inline citation in generation
Source document title or URL included in each citation block
System prompt instructs the model to cite inline using [1], [2] notation

The Complete Eight-Layer Diagnostic Framework

Eight parts, eight named failure modes, one complete diagnostic vocabulary for every layer of the production RAG pipeline:

Layer	Named Concept	Root Cause	Primary Metric
Retrieval strategy	The Retrieval Tax	Wrong backend for query type	nDCG@10 before/after routing
Chunking	Chunking Debt	Bad splits at ingestion	Context recall below 0.8
Embedding	Semantic Compression Loss	Domain terms compressed to wrong proxies	Recall gap on domain eval set
Reranking	The Precision Gap	Bi-encoder rank error	nDCG@10 pre vs post reranker
Evaluation	The Evals Blind Spot	Answer quality measured; retrieval quality ignored	Context recall on golden dataset
Agent governance	The Orchestration Overhead	Loop cost with no ceiling	Avg iterations + tokens per session
Index freshness	The Staleness Gap	Index snapshot diverges from source	Stale fraction in metadata scan
Context assembly	The Context Assembly Gap	Retrieved signal destroyed by ordering, duplication, budget failure	Answer quality delta pre/post assembly

The diagnostic framework is complete. Every production RAG failure traces to one of these eight layers. Each has a named failure mode, a primary measurement, and a specific fix in the corresponding part of this series.

References

Liu, N.F., et al. (2024). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. arXiv:2307.03172. https://arxiv.org/abs/2307.03172
Hsieh, C., et al. (2024). Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization. arXiv:2406.16008. https://arxiv.org/abs/2406.16008
Karpathy, A. (2025). Context Engineering. X/Twitter. https://x.com/karpathy/status/1937902205765607626
Jiang, H., et al. (2023). LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. EMNLP 2023. arXiv:2310.05736. https://arxiv.org/abs/2310.05736
Towards Data Science. (2026). RAG Isn't Enough - I Built the Missing Context Layer That Makes LLM Systems Work. https://towardsdatascience.com/rag-isnt-enough-i-built-the-missing-context-layer-that-makes-llm-systems-work/
Roadie. (2026). Why Conflating RAG with Context Engineering Costs You in Production. https://roadie.io/blog/rag-vs-context-engineering-production/
Redis. (2026). Context Window Management for LLM Apps. https://redis.io/blog/context-window-management-llm-apps-developer-guide/
Redis. (2026). Context Window Overflow in 2026. https://redis.io/blog/context-window-overflow/
Atlan. (2026). LLM Context Window Limitations in 2026. https://atlan.com/know/llm-context-window-limitations/
Harness Engineering Academy. (2026). Context Engineering: The Key Skill Every AI Developer Needs in 2026. https://harnessengineering.academy/blog/context-engineering-the-key-skill-every-ai-developer-needs-in-2026/
Logic Inc. (2026). Context Engineering Guide for AI Teams. https://logic.inc/resources/context-engineering-guide-for-ai-teams
RAGFlow. (2025). From RAG to Context: A 2025 Year-End Review of RAG. https://ragflow.io/blog/rag-review-2025-from-rag-to-context
Morph LLM. (2026). Lost in the Middle LLM: The U-Shaped Attention Problem Explained. https://www.morphllm.com/lost-in-the-middle-llm

Retrieval Augmented Generation

Follow for more technical deep dives on AI/ML systems, production engineering, and building real-world applications:

Why Your RAG Pipeline Assembles Context Wrong

Named Concept: The Context Assembly Gap

The Wrong Way: Concatenate and Send

The Right Way: A Four-Stage Context Assembly Pipeline

Stage 1: Ordering - Put the Most Relevant Document Where the Model Will Read It

Stage 2: Deduplication - Remove Redundant Information Before It Reaches the LLM

Stage 3: Budget Allocation - Make Context a Governed Resource

Stage 4: Compression - Preserve Signal When Budget Is Tight

The Complete Context Assembly Pipeline

The Context Assembly Decision Guide

How the Context Assembly Gap Compounds Prior Failure Modes

The Context Assembly Checklist

The Complete Eight-Layer Diagnostic Framework

References

Books by Ranjan Kumar

Building Real-World Agentic AI Systems

The ChatML Handbook

The Chat Templates Handbook

Comments

Named Concept: The Context Assembly Gap

The Wrong Way: Concatenate and Send

The Right Way: A Four-Stage Context Assembly Pipeline

Stage 1: Ordering - Put the Most Relevant Document Where the Model Will Read It

Stage 2: Deduplication - Remove Redundant Information Before It Reaches the LLM

Stage 3: Budget Allocation - Make Context a Governed Resource

Stage 4: Compression - Preserve Signal When Budget Is Tight

The Complete Context Assembly Pipeline

The Context Assembly Decision Guide

How the Context Assembly Gap Compounds Prior Failure Modes

The Context Assembly Checklist

The Complete Eight-Layer Diagnostic Framework

References

Related Articles

Books by Ranjan Kumar

Building Real-World Agentic AI Systems

The ChatML Handbook

The Chat Templates Handbook

Comments