RanjanKumar.in - AI & ML Engineering

Retrieval-Augmented Generation (RAG) is one of the most effective techniques for making large language models (LLMs) answer accurately using external knowledge.
The idea is straightforward:

Retrieve relevant documents from your knowledge base.
Augment your LLM prompt with those documents.
Generate an answer using the LLM.

Sounds simple, right? The problem is:

Even the best vector search algorithms sometimes return documents that are only loosely related to the query — or miss subtle but highly relevant matches.

This is where Reranking enters the scene — the “quality filter” for your retrieved documents.

What is Reranking in RAG?

Reranking is a second-stage filtering process that reorders retrieved documents by actual relevance to the user query, often using a more sophisticated model than the one used for the initial retrieval.

Think of it as precision tuning:

Stage 1 (vector retrieval) → Fast and broad: retrieve 30–100 potentially relevant docs.
Stage 2 (reranking) → Slow but sharp: deeply score these docs for true relevance.

This two-stage approach mirrors real-world search engines like Google, which first retrieve a broad set of results (recall-focused) and then apply a more precise ranking model (precision-focused).

This is especially important because standard retrieval models (like BM25, dense embeddings) often prioritize speed over deep contextual matching. Reranking uses more advanced models (like cross-encoders) that compare the query and each document together for higher precision.

Why Reranking Matters in RAG

Without reranking, your RAG model might answer from a less relevant document simply because it was retrieved higher by the retriever’s default scoring.

Example:
Imagine a customer of the State Bank of India (SBI) asks:
"What is the minimum balance required for an SBI savings account in a metro city?"

Without Reranking:

Retriever might pull in documents about fixed deposit interest rates, ATM withdrawal limits, and minimum balance rules for rural branches.
The first retrieved document might mention “minimum balance” but for rural accounts, not metro city accounts.

With Reranking:

The reranker analyzes the exact query and re-scores documents so that the top-ranked one specifically contains:
- Metro city rules
- SBI’s updated minimum balance criteria
- Correct fee details if balance is below the limit

This ensures the generator receives the right context and produces a correct answer.

Common Reranking Techniques

Here are the most common approaches used in production RAG systems:

1. Cross-Encoder Models

Takes the query and document together as input.
Outputs a single relevance score.
Pros: Very accurate.
Cons: Slower, since each document is scored independently.

Python Example

code

from sentence_transformers import CrossEncoder# Load a cross-encoder modelmodel = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')# Example queryquery = "What is the minimum balance required for an SBI savings account in a metro city?"# Retrieved documentsdocuments = [    "SBI savings account in metro cities requires a minimum balance of Rs. 3,000 to avoid penalties.",    "SBI fixed deposit interest rates vary between 3% and 6% depending on tenure.",    "In rural areas, SBI savings accounts require a minimum balance of Rs. 1,000."]# Prepare pairs for scoringpairs = [(query, doc) for doc in documents]# Score each document for relevancescores = model.predict(pairs)# Sort by score (descending)reranked_docs = [doc for _, doc in sorted(zip(scores, documents), reverse=True)]print("Reranked Documents:")for doc in reranked_docs:    print(doc)

Sample Output:

code

Reranked Documents:SBI savings account in metro cities requires a minimum balance of Rs. 3,000 to avoid penalties.In rural areas, SBI savings accounts require a minimum balance of Rs. 1,000.SBI fixed deposit interest rates vary between 3% and 6% depending on tenure.

2. Bi-Encoder + Cross-Encoder Hybrid

First, a fast bi-encoder retrieves candidates.
Then, a cross-encoder reranks the top results.
Best of both worlds — speed and accuracy.

Python Example

code

from sentence_transformers import SentenceTransformer, CrossEncoder, utilimport torch# Step 1: Create SBI corpuscorpus = [    "The minimum balance required for SBI savings account is ₹1000 in metro cities.",    "SBI provides 7.5% interest rate for senior citizen fixed deposits.",    "You can link your Aadhaar to your SBI account through the YONO app.",    "SBI charges ₹20 per transaction for ATM withdrawals beyond the free limit.",    "The SBI home loan interest rate starts from 8.5% per annum.",    "SBI credit cards offer reward points on every transaction."]# Step 2: Load Bi-Encoder and Cross-Encoderbi_encoder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')  # For retrievalcross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')  # For reranking# Step 3: Encode corpus for Bi-Encoder retrievalcorpus_embeddings = bi_encoder.encode(corpus, convert_to_tensor=True)# Step 4: User queryquery = "What is the interest rate for senior citizen FD in SBI?"query_embedding = bi_encoder.encode(query, convert_to_tensor=True)# Step 5: Retrieve top N candidates using Bi-Encodertop_k = 3bi_encoder_hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=top_k)[0]# Step 6: Prepare for Cross-Encoder rerankingcross_inp = [(query, corpus[hit['corpus_id']]) for hit in bi_encoder_hits]cross_scores = cross_encoder.predict(cross_inp)# Step 7: Combine results and sort by Cross-Encoder scorereranked_results = sorted(    zip(cross_inp, cross_scores),    key=lambda x: x[1],    reverse=True)# Step 8: Print resultsprint(f"Query: {query}\n")print("Top Results after Reranking:")for (q, passage), score in reranked_results:    print(f"Score: {score:.4f} | {passage}")

Sample Output:

code

Query: What is the interest rate for senior citizen FD in SBI?Top Results after Reranking:Score: 8.5123 | SBI provides 7.5% interest rate for senior citizen fixed deposits.Score: 5.9012 | The SBI home loan interest rate starts from 8.5% per annum.Score: 3.2710 | SBI credit cards offer reward points on every transaction.

3. LLM-based Reranking

Uses large language models (e.g., GPT, LLaMA) to rate document relevance.
Can understand nuanced and multi-step queries.
Higher cost, but sometimes worth it for complex domains.

Python Example

code

from transformers import AutoModelForCausalLM, AutoTokenizerimport torch# 1. SBI Corpuscorpus = [    "The minimum balance required for SBI savings account in metro cities is ₹3000.",    "SBI offers a 3.5% interest rate for savings accounts up to ₹1 lakh.",    "SBI home loan interest rate starts from 8.5% per annum.",    "SBI fixed deposit for senior citizens offers 7.5% per annum interest."]# 2. Simulated Retrieval Outputretrieved_docs = [    corpus[1],  # savings account interest    corpus[3],  # senior citizen FD    corpus[0]   # minimum balance]query = "What interest rate does SBI offer for fixed deposits for senior citizens?"# 3. Load Phi-3-Mini-Instruct Model from Hugging Face# Supports chat-style prompts with system, user, and assistant rolesmodel_name = "microsoft/phi-3-mini-128k-instruct"tokenizer = AutoTokenizer.from_pretrained(model_name)model = AutoModelForCausalLM.from_pretrained(    model_name,    device_map="auto",    torch_dtype="auto",    trust_remote_code=True)# 4. Build prompt for rerankingprompt_prefix = "<|system|>You are an assistant that ranks documents by relevance.<|end|>\n"prompt_prefix += f"<|user|>Query: {query}\nDocuments:\n"for idx, doc in enumerate(retrieved_docs):    prompt_prefix += f"{idx}: {doc}\n"prompt_prefix += "<|assistant|>Provide ranking as list of indexes [most relevant first], plus brief explanation."# 5. Tokenize and generateinputs = tokenizer(prompt_prefix, return_tensors="pt").to(model.device)outputs = model.generate(    **inputs,    max_new_tokens=100,    temperature=0.0)response = tokenizer.decode(outputs[0], skip_special_tokens=True)print("=== Reranking Response ===")print(response)

Sample Output:

code

=== Reranking Response ===[1, 2, 0]The most relevant document is index 1: "SBI fixed deposit for senior citizens offers 7.5% per annum interest." It directly answers the query about FD interest for senior citizens. Next is index 2: "The minimum balance required for SBI savings account in metro cities is ₹3000." While not about fixed deposits, it mentions account-related terms. Index 0: "SBI offers a 3.5% interest rate for savings accounts up to ₹1 lakh." This is least relevant because it talks about savings account rates, not fixed deposit rates.

Best Practices for Reranking in RAG

Limit the candidate pool — Avoid reranking all retrieved results; rerank only the top N (e.g., 50).
Use domain-specific fine-tuning — Fine-tune reranker models on your domain data for better accuracy.
Cache results — For frequent queries, store reranked results to save computation.
Balance speed vs accuracy — In real-time applications, choose models that meet your latency requirements.
Continuously evaluate — Track metrics like MRR (Mean Reciprocal Rank) and nDCG to measure impact.

Conclusion

Reranking acts as a precision filter for RAG pipelines. By ensuring that the right documents make it to the generation stage, you can drastically reduce irrelevant or partially correct answers.

For any production-grade RAG system — whether it’s for banking FAQs, legal document search, or technical support — reranking can be the key differentiator in delivering high-quality, trustworthy AI answers.