← Back to Home

Reranking for RAG: Boosting Answer Quality in Retrieval-Augmented Generation

ai_mlgenaillms
#genai#llm#python#rag#reranking#retrieval-augmented-generation

Retrieval-Augmented Generation (RAG) is one of the most effective techniques for making large language models (LLMs) answer accurately using external knowledge.
The idea is straightforward:

  1. Retrieve relevant documents from your knowledge base.

  2. Augment your LLM prompt with those documents.

  3. Generate an answer using the LLM.

Sounds simple, right? The problem is:

Even the best vector search algorithms sometimes return documents that are only loosely related to the query — or miss subtle but highly relevant matches.

This is where Reranking enters the scene — the “quality filter” for your retrieved documents.

What is Reranking in RAG?

Reranking is a second-stage filtering process that reorders retrieved documents by actual relevance to the user query, often using a more sophisticated model than the one used for the initial retrieval.

Think of it as precision tuning:

  • Stage 1 (vector retrieval) → Fast and broad: retrieve 30–100 potentially relevant docs.

  • Stage 2 (reranking) → Slow but sharp: deeply score these docs for true relevance.

This two-stage approach mirrors real-world search engines like Google, which first retrieve a broad set of results (recall-focused) and then apply a more precise ranking model (precision-focused).

This is especially important because standard retrieval models (like BM25, dense embeddings) often prioritize speed over deep contextual matching. Reranking uses more advanced models (like cross-encoders) that compare the query and each document together for higher precision.

Why Reranking Matters in RAG

Without reranking, your RAG model might answer from a less relevant document simply because it was retrieved higher by the retriever’s default scoring.

Example:
Imagine a customer of the State Bank of India (SBI) asks:
"What is the minimum balance required for an SBI savings account in a metro city?"

Without Reranking:

  • Retriever might pull in documents about fixed deposit interest rates, ATM withdrawal limits, and minimum balance rules for rural branches.

  • The first retrieved document might mention “minimum balance” but for rural accounts, not metro city accounts.

With Reranking:

  • The reranker analyzes the exact query and re-scores documents so that the top-ranked one specifically contains:
    • Metro city rules

    • SBI’s updated minimum balance criteria

    • Correct fee details if balance is below the limit

This ensures the generator receives the right context and produces a correct answer.

Common Reranking Techniques

Here are the most common approaches used in production RAG systems:

1. Cross-Encoder Models

  • Takes the query and document together as input.

  • Outputs a single relevance score.

  • Pros: Very accurate.

  • Cons: Slower, since each document is scored independently.

Python Example
code
from sentence_transformers import CrossEncoder# Load a cross-encoder modelmodel = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')# Example queryquery = "What is the minimum balance required for an SBI savings account in a metro city?"# Retrieved documentsdocuments = [    "SBI savings account in metro cities requires a minimum balance of Rs. 3,000 to avoid penalties.",    "SBI fixed deposit interest rates vary between 3% and 6% depending on tenure.",    "In rural areas, SBI savings accounts require a minimum balance of Rs. 1,000."]# Prepare pairs for scoringpairs = [(query, doc) for doc in documents]# Score each document for relevancescores = model.predict(pairs)# Sort by score (descending)reranked_docs = [doc for _, doc in sorted(zip(scores, documents), reverse=True)]print("Reranked Documents:")for doc in reranked_docs:    print(doc)

Sample Output:

code
Reranked Documents:SBI savings account in metro cities requires a minimum balance of Rs. 3,000 to avoid penalties.In rural areas, SBI savings accounts require a minimum balance of Rs. 1,000.SBI fixed deposit interest rates vary between 3% and 6% depending on tenure.

2. Bi-Encoder + Cross-Encoder Hybrid

  • First, a fast bi-encoder retrieves candidates.

  • Then, a cross-encoder reranks the top results.

  • Best of both worlds — speed and accuracy.

Python Example
code
from sentence_transformers import SentenceTransformer, CrossEncoder, utilimport torch# Step 1: Create SBI corpuscorpus = [    "The minimum balance required for SBI savings account is ₹1000 in metro cities.",    "SBI provides 7.5% interest rate for senior citizen fixed deposits.",    "You can link your Aadhaar to your SBI account through the YONO app.",    "SBI charges ₹20 per transaction for ATM withdrawals beyond the free limit.",    "The SBI home loan interest rate starts from 8.5% per annum.",    "SBI credit cards offer reward points on every transaction."]# Step 2: Load Bi-Encoder and Cross-Encoderbi_encoder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')  # For retrievalcross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')  # For reranking# Step 3: Encode corpus for Bi-Encoder retrievalcorpus_embeddings = bi_encoder.encode(corpus, convert_to_tensor=True)# Step 4: User queryquery = "What is the interest rate for senior citizen FD in SBI?"query_embedding = bi_encoder.encode(query, convert_to_tensor=True)# Step 5: Retrieve top N candidates using Bi-Encodertop_k = 3bi_encoder_hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=top_k)[0]# Step 6: Prepare for Cross-Encoder rerankingcross_inp = [(query, corpus[hit['corpus_id']]) for hit in bi_encoder_hits]cross_scores = cross_encoder.predict(cross_inp)# Step 7: Combine results and sort by Cross-Encoder scorereranked_results = sorted(    zip(cross_inp, cross_scores),    key=lambda x: x[1],    reverse=True)# Step 8: Print resultsprint(f"Query: {query}\n")print("Top Results after Reranking:")for (q, passage), score in reranked_results:    print(f"Score: {score:.4f} | {passage}")

Sample Output:

code
Query: What is the interest rate for senior citizen FD in SBI?Top Results after Reranking:Score: 8.5123 | SBI provides 7.5% interest rate for senior citizen fixed deposits.Score: 5.9012 | The SBI home loan interest rate starts from 8.5% per annum.Score: 3.2710 | SBI credit cards offer reward points on every transaction.

3. LLM-based Reranking

  • Uses large language models (e.g., GPT, LLaMA) to rate document relevance.

  • Can understand nuanced and multi-step queries.

  • Higher cost, but sometimes worth it for complex domains.

Python Example
code
from transformers import AutoModelForCausalLM, AutoTokenizerimport torch# 1. SBI Corpuscorpus = [    "The minimum balance required for SBI savings account in metro cities is ₹3000.",    "SBI offers a 3.5% interest rate for savings accounts up to ₹1 lakh.",    "SBI home loan interest rate starts from 8.5% per annum.",    "SBI fixed deposit for senior citizens offers 7.5% per annum interest."]# 2. Simulated Retrieval Outputretrieved_docs = [    corpus[1],  # savings account interest    corpus[3],  # senior citizen FD    corpus[0]   # minimum balance]query = "What interest rate does SBI offer for fixed deposits for senior citizens?"# 3. Load Phi-3-Mini-Instruct Model from Hugging Face# Supports chat-style prompts with system, user, and assistant rolesmodel_name = "microsoft/phi-3-mini-128k-instruct"tokenizer = AutoTokenizer.from_pretrained(model_name)model = AutoModelForCausalLM.from_pretrained(    model_name,    device_map="auto",    torch_dtype="auto",    trust_remote_code=True)# 4. Build prompt for rerankingprompt_prefix = "<|system|>You are an assistant that ranks documents by relevance.<|end|>\n"prompt_prefix += f"<|user|>Query: {query}\nDocuments:\n"for idx, doc in enumerate(retrieved_docs):    prompt_prefix += f"{idx}: {doc}\n"prompt_prefix += "<|assistant|>Provide ranking as list of indexes [most relevant first], plus brief explanation."# 5. Tokenize and generateinputs = tokenizer(prompt_prefix, return_tensors="pt").to(model.device)outputs = model.generate(    **inputs,    max_new_tokens=100,    temperature=0.0)response = tokenizer.decode(outputs[0], skip_special_tokens=True)print("=== Reranking Response ===")print(response)

Sample Output:

code
=== Reranking Response ===[1, 2, 0]The most relevant document is index 1: "SBI fixed deposit for senior citizens offers 7.5% per annum interest." It directly answers the query about FD interest for senior citizens. Next is index 2: "The minimum balance required for SBI savings account in metro cities is ₹3000." While not about fixed deposits, it mentions account-related terms. Index 0: "SBI offers a 3.5% interest rate for savings accounts up to ₹1 lakh." This is least relevant because it talks about savings account rates, not fixed deposit rates.

Best Practices for Reranking in RAG

  1. Limit the candidate pool — Avoid reranking all retrieved results; rerank only the top N (e.g., 50).

  2. Use domain-specific fine-tuning — Fine-tune reranker models on your domain data for better accuracy.

  3. Cache results — For frequent queries, store reranked results to save computation.

  4. Balance speed vs accuracy — In real-time applications, choose models that meet your latency requirements.

  5. Continuously evaluate — Track metrics like MRR (Mean Reciprocal Rank) and nDCG to measure impact.

Conclusion

Reranking acts as a precision filter for RAG pipelines. By ensuring that the right documents make it to the generation stage, you can drastically reduce irrelevant or partially correct answers.

For any production-grade RAG system — whether it’s for banking FAQs, legal document search, or technical support — reranking can be the key differentiator in delivering high-quality, trustworthy AI answers.