← Back to Guides
4

Series

RAG Engineering in Production· Part 4

GuideFor: AI Engineers, ML Engineers, Platform Engineers, AI Systems Architects

Why Your Reranker Is the Last Line You Forgot to Build

Retrieval gets you recall. Reranking gets you precision. Skipping it means your LLM reads the wrong documents with complete confidence - and you will not know until production.

#rag#reranking#cross-encoder#colbert#precision#retrieval#production-ai#llm-infrastructure

Your hybrid retrieval returns 50 candidates. You pass the top 5 to the LLM. The answer is confident, specific, and wrong in the exact way that damages trust: it cites the right topic from the wrong document, or the right document from the wrong time period, or a clause that was superseded six months ago and sits two positions below the one that would have answered correctly.

The document that would have answered correctly was in position 7. Your bi-encoder ranked it there because it measured approximate semantic similarity between independently encoded vectors. Position 7 was close to the answer. It was not the answer.

This is the precision problem that every RAG system has and most teams do not name until a production incident forces them to. Bi-encoders and BM25 are recall engines - they are designed and optimized to find documents that are probably relevant, quickly, across millions of candidates. They are not precision engines. They were never trained to score the interaction between query tokens and document tokens jointly. They were trained to encode each side independently and compute a vector distance.

Reranking is what converts recall into precision. It takes the broad candidate set your first-stage retriever produces and re-scores each candidate by running a model that sees both the query and the document simultaneously - joint attention, full token interaction. The model that does that is slower by design. You cannot run it against millions of documents. But you can run it against 50-100 candidates, which is exactly the right place to run it.

The thesis of this article is direct: a RAG pipeline without a reranker is not a production RAG pipeline - it is a demo that has not yet encountered the query that will break it. In Part 1 of this series, we showed that adding a reranker to a hybrid retrieval pipeline reduces the retrieval failure rate from 5.7% to 1.9% - a 67% reduction, verified against Anthropic's Contextual Retrieval benchmarks. That single number should be the business case for every engineering team that has not yet added one.


The Precision Gap: Why Recall Is Not Enough

The Precision Gap is the difference between what your bi-encoder retrieves at position 1 and what actually belongs at position 1. It exists because bi-encoders encode queries and documents independently and then compute vector similarity - a fast approximation of relevance that holds up across millions of candidates but degrades on close-but-wrong results.

The failure modes are specific and consistent:

Negations. "What policies do NOT apply to contractors in regulated projects?" Your bi-encoder finds documents about contractor policies in regulated projects. The NOT is not well-represented in embedding space. The most similar documents are the ones that discuss what does apply, not what does not. A cross-encoder sees the query and document together, token by token, and detects the negation.

Temporal constraints. "What were the Q1 2025 projections before the policy change in March?" Dense embeddings place Q1 2025 documents and Q2 2025 documents close together - they are about the same fiscal year, same company, similar language. The temporal boundary is a retrieval concern the vector space cannot reliably encode. A cross-encoder scores the pair jointly and detects that the retrieved document postdates the constraint in the query.

Conditional clauses. "What happens to the SLA if the incident is classified as P1 AND affects more than 500 users?" Two conditions, both necessary. A bi-encoder finds documents about P1 incidents. It also finds documents about SLAs for large user bases. It does not reliably distinguish between documents that satisfy both conditions and documents that satisfy one. A cross-encoder does.

Domain-specific terminology that embeds ambiguously. As we established in Part 3, Semantic Compression Loss distorts the embedding space for specialized domains. Even well-formed chunks encoded by an appropriately chosen model will have residual compression artifacts. The reranker, operating on raw token interaction between query and document, partially compensates - it does not need to rely on the geometry of the embedding space because it reads the text directly.

None of these failure modes are visible in aggregate recall metrics. Your Recall@50 is fine - the right document is somewhere in the top 50. The Precision@5 that the LLM actually sees is broken.


Named Concept: The Precision Gap

The Precision Gap is the quality delta between your first-stage retriever's top-k and the true top-k for the query. It is measured as the difference in nDCG@10 between your retrieval output and a reranked output on the same candidate set. A Precision Gap of 0 means your bi-encoder is ranking correctly - an almost impossible condition on real corpora. A Precision Gap above 0.05 nDCG@10 means your LLM is routinely receiving suboptimal context, and some fraction of your answers are wrong because of it.

The Precision Gap is structural, not incidental. It exists because bi-encoders trade precision for speed. You cannot close it by tuning your embedding model, changing your chunk size, or increasing your candidate set. You can only close it by running a second-stage model that scores query-document pairs jointly.

The corollary: every pipeline component upstream of the reranker - retrieval strategy (Part 1), chunking (Part 2), embedding model (Part 3) - reduces the Precision Gap by improving the quality of candidates handed to the reranker. But none of them eliminate it. The Retrieval Tax from Part 1 has a precision component that only the reranker can collect.


The Wrong Way: Trusting Bi-Encoder Ranking at Top-K

code
# Wrong way: returning bi-encoder top-k directly to the LLM# This is the pattern in most RAG tutorials and starter templatesfrom langchain_community.vectorstores import Chromafrom langchain_openai import OpenAIEmbeddingsvectorstore = Chroma(embedding_function=OpenAIEmbeddings())def retrieve_and_answer(query: str, llm) -> str:    # Stage 1: retrieve top-5 by cosine similarity    # The bi-encoder ranked these by vector distance.    # Documents at positions 3-5 may be close-but-wrong:    # same topic, wrong time period, negated clause, wrong condition.    docs = vectorstore.similarity_search(query, k=5)        context = "\n\n".join([d.page_content for d in docs])        # The LLM receives whatever the bi-encoder ranked highest.    # If the correct document is at position 6, this answer is wrong.    # The LLM will not signal that - it will answer confidently    # from whatever context it receives.    prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"    return llm.invoke(prompt).content# What this produces:# Query: "What SLA applies when a P1 incident affects more than 500 users?"# Retrieved: [P1 SLA doc (general), P1 SLA doc (small teams),#             500-user threshold doc (unrelated product),#             SLA escalation doc, SLA reporting doc]# Missing: the P1 + 500-user combined SLA that was at position 6# Result: confident answer citing the wrong SLA tier

The Right Way: Two-Stage Retrieve-Then-Rerank

code
from langchain_community.vectorstores import Chromafrom langchain_openai import OpenAIEmbeddingsfrom sentence_transformers import CrossEncoderimport numpy as np# Stage 1 retriever: wide net, optimized for recallvectorstore = Chroma(embedding_function=OpenAIEmbeddings())# Stage 2 reranker: cross-encoder, optimized for precision# ms-marco-MiniLM-L-6-v2: nDCG@10 ~0.662, p95 ~55ms for 30 candidates# Correct default for sub-500ms SLA budgetsreranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")def retrieve_and_rerank(    query: str,    retrieve_k: int = 50,    rerank_top_n: int = 5,) -> list[str]:    """    Two-stage retrieval: broad recall then precise reranking.    Returns top-n document texts after reranking.        retrieve_k: how many candidates to pull from first stage (recall budget)    rerank_top_n: how many to return after reranking (precision budget)        Latency profile with MiniLM cross-encoder:      - retrieve_k=50: ~100-150ms on CPU, ~30ms on GPU      - retrieve_k=100: ~200-300ms on CPU, ~60ms on GPU        Rule: if your total latency budget is 500ms,    spend 50-100ms on retrieval and 150-200ms on reranking.    Reranking top-50 returns higher ROI than retrieving top-500.    """    # Stage 1: fast recall via bi-encoder similarity    docs = vectorstore.similarity_search(query, k=retrieve_k)        if not docs:        return []        # Stage 2: precise reranking via cross-encoder    pairs = [(query, doc.page_content) for doc in docs]    scores = reranker.predict(pairs)        # Sort by cross-encoder score descending, return texts only    ranked = sorted(        zip(scores, docs),        key=lambda x: x[0],        reverse=True,    )        return [doc.page_content for _, doc in ranked[:rerank_top_n]]def retrieve_rerank_and_answer(query: str, llm) -> str:    """    Full two-stage pipeline with reranking before generation.    """    reranked_texts = retrieve_and_rerank(query, retrieve_k=50, rerank_top_n=5)        if not reranked_texts:        return "No relevant documents found."        context = "\n\n".join(reranked_texts)    prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"    return llm.invoke(prompt).content

Choosing Your Reranker: The Three-Way Split

The reranker decision is a latency-cost-quality triangle. There is no universal winner - the right choice depends on your SLA, your infrastructure constraints, and your domain.

Cross-Encoders: The Precision Standard

Cross-encoders process the query and document as a single input through the transformer. Every query token attends to every document token. This is full interaction - the most accurate scoring mechanism available.

The cost is that you must run one forward pass per candidate. At 50ms per document on CPU, reranking 100 candidates takes 5 seconds. That is the boundary condition: cross-encoders are practical only on small candidate sets (typically 30-100), which is exactly the right candidate set size for second-stage reranking after first-stage retrieval.

Model selection guide (September 2025 benchmark):

ModelnDCG@10p95 LatencyCost/1K queriesWhen to use
Cohere Rerank v4 Pro0.735+~210ms (API)~$1-2Managed API, no GPU, quality priority
bge-reranker-large-v20.715145ms (CPU)~$0.35 (self-host)Best open-weight quality per cost
bge-reranker-base-v20.69992ms (CPU)~$0.18 (self-host)90% of quality at half the cost
MiniLM-L-6-v2 (ms-marco)0.66255ms (CPU)~$0.08 (self-host)Sub-500ms SLA, high-throughput baseline
Jina-reranker-v2-multiling0.694110ms (CPU)~$0.30 (self-host)Multilingual corpora

Use MiniLM as the default for synchronous RAG with sub-500ms SLA budgets. Use bge-reranker-large-v2 when you have GPU access and quality matters more than cost. Use Cohere Rerank when you want a managed API and do not want to operate GPU infrastructure.

ColBERT: The Throughput Alternative

ColBERT is late interaction retrieval: queries and documents are encoded separately into token-level embeddings, and scoring happens via the MaxSim operation - for each query token, find the maximum similarity across all document tokens, then sum. Document representations are precomputed at index time. Only the MaxSim operation runs at query time.

The result: near cross-encoder accuracy at a fraction of the latency. A Towards Data Science benchmark (April 2026) measured ColBERT p50 latency at 23ms at 40 QPS, while a full cross-encoder on the same load hit p99.9 over 21 seconds. ColBERT is the architecture for high-throughput pipelines where tail latency matters more than absolute precision.

The tradeoff: ColBERT requires per-token storage for every document in the corpus. A 1M-document corpus with 256-token average chunks requires substantial index storage. For retrieval-only use cases at scale, this is workable. For most RAG pipelines serving interactive queries at moderate QPS, a batched cross-encoder is simpler and close enough in accuracy.

LLM-Based Reranking: When to Use It and When Not To

LLM-based reranking sends the query and each candidate document to a language model with a relevance scoring prompt. The LLM outputs a relevance judgment. This approach can be 5-8% more accurate than cross-encoders on listwise ranking tasks (ZeroEntropy, 2025).

It also adds 4-6 seconds of latency and costs an order of magnitude more per query. The use cases are narrow: offline batch processing, high-stakes retrieval where accuracy outweighs latency (legal discovery, compliance review), and queries where you need the reranker to explain its relevance judgments for audit purposes.

For synchronous, interactive RAG: do not use LLM-based reranking. The latency is incompatible with user-facing SLAs. The accuracy gain does not justify the cost in conversational contexts.


Production Reranking Patterns

Pattern 1: Selective Reranking (The Cost-Control Gate)

Not every query needs reranking. If the first-stage retriever returns a candidate with very high similarity, it probably ranked correctly. The Precision Gap is smallest when the top candidate is a near-perfect match.

code
from sentence_transformers import CrossEncoderfrom langchain_community.vectorstores import Chromafrom langchain_openai import OpenAIEmbeddingsvectorstore = Chroma(embedding_function=OpenAIEmbeddings())reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")# Threshold calibrated on your eval set.# A first-stage score above this means the top candidate# is likely correct - skip reranking to save latency.RERANK_THRESHOLD = 0.85def selective_rerank(    query: str,    retrieve_k: int = 50,    rerank_top_n: int = 5,) -> list[str]:    """    Only rerank when first-stage confidence is below threshold.    This reduces average latency without sacrificing precision on hard queries.        Tune RERANK_THRESHOLD on your eval set:    - Too high: you skip reranking when you should not    - Too low: you rerank everything (no cost savings)    - Target: skip reranking on ~20-30% of queries    """    # Retrieve with scores    results = vectorstore.similarity_search_with_score(query, k=retrieve_k)        if not results:        return []        docs, scores = zip(*results)    top_score = scores[0]  # Cosine similarity of best candidate        # High-confidence: top candidate is likely correct, skip reranking    if top_score >= RERANK_THRESHOLD:        return [doc.page_content for doc in docs[:rerank_top_n]]        # Low-confidence: first-stage ranking is uncertain, apply reranker    pairs = [(query, doc.page_content) for doc in docs]    rerank_scores = reranker.predict(pairs)        ranked = sorted(        zip(rerank_scores, docs),        key=lambda x: x[0],        reverse=True,    )    return [doc.page_content for _, doc in ranked[:rerank_top_n]]

Pattern 2: Cascaded Reranking (Three-Stage for Tight SLAs)

When you have a large candidate set and a tight latency budget, run two reranking stages: a fast lightweight reranker to trim candidates, then an accurate heavy reranker on the trimmed set.

code
from sentence_transformers import CrossEncoder# Stage 1 reranker: fast, lightweight, good enough for coarse filteringfast_reranker = CrossEncoder("cross-encoder/ms-marco-TinyBERT-L-2-v2")# Stage 2 reranker: accurate, for final precisionaccurate_reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")def cascaded_rerank(    query: str,    candidates: list[str],  # document texts from first-stage retrieval    fast_top_n: int = 20,    final_top_n: int = 5,) -> list[str]:    """    Two-stage reranking cascade over pre-retrieved candidate texts.    Call after first-stage retrieval: candidates = [d.page_content for d in vectorstore.similarity_search(query, k=100)]        Stage 1: TinyBERT trims 50-100 candidates to 20.              TinyBERT: ~2x faster than MiniLM, ~10% quality loss.              Acceptable for coarse filtering, not for final ranking.        Stage 2: MiniLM produces final precise ranking over the top-20 only.        Total latency: comparable to running MiniLM on top-20 directly,    but starting from 50-100 candidates rather than pre-selected ones.    This matters when you cannot trust first-stage ordering at all.    """    if not candidates:        return []        # Stage 1: fast coarse filter    fast_pairs = [(query, doc) for doc in candidates]    fast_scores = fast_reranker.predict(fast_pairs)        fast_ranked = sorted(        zip(fast_scores, candidates),        key=lambda x: x[0],        reverse=True,    )    top_candidates = [doc for _, doc in fast_ranked[:fast_top_n]]        # Stage 2: accurate final ranking    accurate_pairs = [(query, doc) for doc in top_candidates]    accurate_scores = accurate_reranker.predict(accurate_pairs)        final_ranked = sorted(        zip(accurate_scores, top_candidates),        key=lambda x: x[0],        reverse=True,    )    return [doc for _, doc in final_ranked[:final_top_n]]

Pattern 3: Score Threshold Filtering (Relevance Floor)

Cross-encoder scores are calibrated probabilities on ms-marco models. A score below a threshold signals that even the best candidate is not relevant to this query - you should not pass it to the LLM at all.

code
from sentence_transformers import CrossEncoderreranker = CrossEncoder(    "cross-encoder/ms-marco-MiniLM-L-6-v2",    default_activation_function=None,  # get raw logits for calibration)# Minimum relevance score to pass a document to the LLM.# Below this: the document is not relevant enough to include.# Tune on your eval set; 0.0 is a reasonable starting point for ms-marco models# because logits center around 0.RELEVANCE_FLOOR = 0.0def rerank_with_floor(    query: str,    candidates: list[str],    top_n: int = 5,) -> list[str]:    """    Rerank and filter by relevance floor.    If no documents clear the floor, return empty rather than passing    irrelevant context to the LLM and triggering a confident wrong answer.        Returning empty signals "I do not have this information" cleanly.    Returning below-floor documents signals "here is something close"    and the LLM will confabulate an answer from it.    """    if not candidates:        return []        pairs = [(query, doc) for doc in candidates]    scores = reranker.predict(pairs)        ranked = sorted(        zip(scores, candidates),        key=lambda x: x[0],        reverse=True,    )        # Filter to documents above the relevance floor    relevant = [        doc for score, doc in ranked[:top_n]        if score >= RELEVANCE_FLOOR    ]        return relevant  # Empty if nothing clears the floor - intentional

The Complete Four-Stage Pipeline

After four parts, the pipeline is fully assembled. Each layer addresses a specific failure mode that the layers above cannot compensate for:

mermaid
flowchart TD
    Q[User query] --> RT{Route by\nquery type\nPart 1}
    RT -- Structured data --> SQL[SQL RAG]
    RT -- Relational query --> GR[Graph RAG]
    RT -- Semantic query --> H1[Hybrid retrieval\nBM25 + Dense + RRF\nPart 1]
    H1 --> CS[Retrieve k=50-100\ncandidates]
    CS --> CK{Chunking\nstrategy\nPart 2}
    CK --> EM[Embedding model\naligned to domain\nPart 3]
    EM --> RR[Cross-encoder\nreranker\nPart 4]
    RR --> FL{Score above\nrelevance floor?}
    FL -- Yes --> LLM[LLM generation\nwith top-5 context]
    FL -- No --> NA[No relevant\ncontent found]
    SQL --> LLM
    GR --> LLM

    style Q fill:#4A90E2,color:#fff
    style RT fill:#7B68EE,color:#fff
    style SQL fill:#98D8C8,color:#333
    style GR fill:#98D8C8,color:#333
    style H1 fill:#4A90E2,color:#fff
    style CS fill:#4A90E2,color:#fff
    style CK fill:#9B59B6,color:#fff
    style EM fill:#9B59B6,color:#fff
    style RR fill:#6BCF7F,color:#fff
    style FL fill:#7B68EE,color:#fff
    style LLM fill:#6BCF7F,color:#fff
    style NA fill:#E74C3C,color:#fff

The diagram makes the dependency chain explicit. Retrieval strategy (Part 1) determines which backend runs. Chunking (Part 2) determines what enters the index. Embedding model (Part 3) determines how it is encoded. Reranking (Part 4 - this article) determines what the LLM finally sees.

Each upstream layer reduces the work the downstream layer has to do. Good chunking reduces the Precision Gap that the reranker has to close. Domain-aligned embeddings reduce the Semantic Compression Loss that the reranker partially compensates for. But none of them replace the reranker's function: joint query-document scoring by a model that reads both simultaneously.


The Reranking Deployment Checklist

Latency budget:

  • Total SLA defined (e.g. p95 <500ms end-to-end)
  • Retrieval budget set (50-100ms for first stage)
  • Reranking budget set (remainder, typically 150-300ms)
  • Candidate set size chosen to fit reranking budget: 30 candidates on CPU MiniLM = ~100-150ms; 50 candidates on GPU = ~30-50ms
  • If budget exceeded: consider cascaded reranking or ColBERT for high-QPS paths

Model selection:

  • Evaluated at least MiniLM baseline and one stronger model (bge-reranker-large-v2 or Cohere) against your eval set
  • Checked multilingual support if corpus is non-English
  • Confirmed licensing for self-hosted models (bge, jina: Apache 2.0; Cohere: commercial API)
  • Decided between managed API (Cohere) and self-hosted (bge, MiniLM) based on infrastructure constraints

Operational patterns:

  • Selective reranking threshold calibrated on eval set (target: skip reranking on ~20-30% of high-confidence queries)
  • Relevance floor threshold set and tested (prevents confident wrong answers on out-of-scope queries)
  • Reranking scores logged for monitoring: Precision Gap visible in production metrics
  • Fallback behavior defined: what happens if reranker is unavailable (circuit breaker to first-stage results, not hard failure)

Evaluation:

  • Precision@5 measured before and after reranking on representative query set
  • nDCG@10 computed for both stages: the delta is your Precision Gap
  • Answer faithfulness measured end-to-end: does reranking improve final answer quality, not just retrieval ranking?

Where the Pipeline Stands

Four parts in, the core RAG pipeline is fully specified and the named concepts form a coherent diagnostic framework:

  • The Retrieval Tax (Part 1): the compounding cost of wrong retrieval strategy per query
  • Chunking Debt (Part 2): accumulated quality degradation from early chunking decisions
  • Semantic Compression Loss (Part 3): information destroyed by domain-misaligned embeddings
  • The Precision Gap (Part 4 - this article): the quality delta between bi-encoder top-k and the true top-k

Each concept names a specific, measurable failure mode at a specific layer of the pipeline. Each has a specific fix. Together they give a production RAG system its full diagnostic vocabulary.

Part 5 covers what happens when you do not know whether any of this is working: RAG evaluation, the evals blind spot, and why most teams are flying blind on retrieval quality until a production incident forces the question.


References


Retrieval Augmented Generation

More Articles

Follow for more technical deep dives on AI/ML systems, production engineering, and building real-world applications:


Comments