RanjanKumar.in - AI & ML Engineering

You've built a RAG system. The embedding model is state-of-the-art. Your vector database is fast. The LLM is capable. Your demo works beautifully.

Then you deploy to production and watch it fail in ways that make no sense. The model hallucinates despite having the right documents in the corpus. It confidently invents error codes that don't exist. It returns semantically plausible answers that are completely wrong for your specific use case.

You check the prompt. You tune the temperature. You try a bigger model. Nothing fundamentally improves.

Here's what nobody tells you upfront: your RAG system isn't failing because of the LLM. It's failing because of retrieval.

The language model can only work with what you give it. Feed it the wrong context and no amount of prompting will save you. Feed it incomplete context and it'll fill the gaps with its training data, which is almost never what you want in a RAG system.

This article is about the retrieval decisions that actually matter when you're trying to keep a RAG system working in production—not on benchmarks, not in demos, but when real users are asking real questions and expecting correct answers. And because infrastructure costs matter (especially in India where GPU access isn't a given), I'll be honest about what things actually cost to run.

The Thing About RAG That Changes Everything

Traditional search and RAG retrieval solve fundamentally different problems, even though they use similar technology.

In traditional search, you're helping humans find documents. The user sees a list of results, evaluates them, clicks through, and decides what's relevant. You're optimizing for "useful links on the first page." If something isn't quite right, the human can ignore it and try the next result.

In RAG, you're constructing context for an LLM. The model doesn't "browse" your results. It doesn't skeptically evaluate sources. It treats whatever you put in the context window as ground truth and synthesizes an answer from it. If you give it three chunks where two are correct and one is subtly wrong, it'll often blend all three into an answer that sounds confident but contains the error.

This changes the entire game. In RAG:

Precision matters more than you think. A single irrelevant chunk doesn't just waste space—it can actively mislead the model. I've seen systems where adding the 6th retrieval result consistently made answers worse because it introduced conflicting information.

Recall failures are invisible. If you miss a critical chunk, the model doesn't say "I don't have enough information." It fills the gap with plausible-sounding content from its training data. Your user gets a confident answer. You get a silent failure that's hard to detect.

You can't rely on the model to filter. LLMs are optimized to be helpful and to synthesize information, not to critically evaluate source quality. They'll try to make sense of whatever context you give them, even when they shouldn't.

This is why most RAG failures trace back to retrieval, not model quality. You're not building a search engine. You're building a fact-assembly pipeline where mistakes compound.

When I Actually Use BM25 (And Why)

Let me tell you about a production incident that taught me when BM25 matters.

We had a customer support RAG system. Users would ask questions like "Error code E4392 keeps appearing, what does it mean?" Our dense retriever would return semantically similar error messages—things about configuration errors, timeout errors, connection errors. All plausible. None of them about E4392.

The problem? Our embedding model was treating "E4392" like noise. It's a weird token that doesn't appear in normal language. The model learned to focus on the semantic meaning around it ("error code", "appearing", "what does it mean") and basically ignored the actual identifier.

Meanwhile, we had a document that literally said "Error E4392: Database connection pool exhausted." BM25 would have found it instantly—exact term match, high IDF weight because that error code is rare. But we weren't using BM25 because "semantic search is better."

That's when I learned: BM25 isn't about semantic understanding. It's about not dropping facts on the floor.

Here's what BM25 actually does well in RAG:

Exact identifiers survive. Error codes, ticket IDs, API endpoints, function names, configuration keys—anything that needs to be precisely correct rather than semantically similar. If your corpus contains "endpoint /api/v2/users" and the query mentions that exact path, BM25 will find it. Dense retrieval might return "/api/v1/users" or "/api/v2/accounts" because they're semantically close.

Domain terminology is preserved. Legal documents use specific clause references. Medical records use precise condition names. Technical docs use exact protocol names. BM25 respects this specificity instead of smoothing it into general semantic categories.

Rare but critical terms get weight. The IDF (inverse document frequency) component of BM25 automatically upweights rare terms. If your query contains an unusual technical term that appears in only three documents, BM25 will prioritize those documents. Dense retrieval might diffuse that signal across dozens of semantically related but less relevant chunks.

You can debug it. When BM25 retrieves something, you can look at the query terms and see exactly why. There's no mystery about attention patterns or embedding spaces. This matters enormously when you're trying to fix production issues at 2 AM.

Here's a real example from our error code system. Query: "What causes E4392?"

BM25 scoring (simplified):

code

Document 1: "Error E4392: Database connection pool exhausted"- "E4392": high weight (rare term, exact match)- "Error": medium weight (common but present)- Score: 8.2Document 2: "Common database errors and troubleshooting"- "Error": medium weight- "database": low weight (common)- No "E4392"- Score: 2.1

BM25 returns Document 1 at the top. The LLM gets the exact error definition. Answer is correct.

Dense retrieval on the same query:

code

Document 1: "Error E4392: Database connection pool exhausted"- Embedding similarity: 0.71Document 2: "Database connection timeout errors"- Embedding similarity: 0.76 (higher!)Document 3: "Connection pool configuration guide"- Embedding similarity: 0.74

(Note: These similarity scores are illustrative examples based on typical model behavior. Actual scores depend on your embedding model and corpus.)

Dense retrieval returns Documents 2 and 3 first. They're semantically related to database errors and connections, but they don't contain the specific error code. The LLM tries to answer based on general connection issues. Answer is plausible but wrong.

That's the fundamental difference. BM25 asks "do the important words match?" Dense retrieval asks "does this feel related?"

For some queries, "feels related" is exactly what you want. For others, it's how you get confidently wrong answers.

When Dense Retrieval Actually Helps (And When It Doesn't)

Dense retrieval isn't overhyped—it solves real problems that BM25 can't. But it also introduces new failure modes that are easy to miss until you're in production.

Let me show you where it helps first. Same support system, different query: "How do I recover a deleted account?"

BM25 struggles here because the documentation says "account restoration procedure" not "recover deleted account." Zero term overlap on "recover" and "deleted." BM25 returns documentation about account creation, account settings, anything with "account" in it.

Dense retrieval embeds both the query and the documents into semantic space. It learns that "recover" and "restore" are similar. "Deleted" and "removed" are similar. "Account restoration procedure" matches the intent of "recover deleted account" even without shared terms.

This is powerful for natural language queries where users express the same intent in different ways. It's essential for conversational interfaces where follow-up questions lack explicit context ("What about retries?" after asking about API errors).

Here's where dense retrieval consistently helps in RAG:

Synonym and paraphrase handling. Users say "reset password," "forgot password," "recover account access," "can't log in" to mean roughly the same thing. Dense retrieval maps these to the same area of semantic space.

Intent-based matching. Questions like "Why does the system slow down at night?" don't contain the technical terms in your documentation ("batch processing," "scheduled maintenance," "cron jobs"), but dense retrieval can still connect them.

Exploratory queries. When users don't know the exact terminology, they describe problems in their own words. Dense retrieval gives them a fighting chance of finding relevant information.

But here's what the tutorials don't tell you about dense retrieval in production RAG:

It over-generalizes short queries. Query: "timeout". Dense retrieval returns chunks about network timeouts, database timeouts, API timeouts, authentication timeouts, session timeouts. All semantically similar to the concept of "timeout." But which one does the user actually need? You don't know, and now you're passing 5-10 chunks to the LLM hoping it figures it out.

It treats symbols as noise. Embedding models are trained on natural language. They've learned that weird tokens like "E4392" or "/api/v2/users" or "CONFIG_MAX_RETRIES=5" are less important than the readable text around them. This is exactly wrong for technical RAG systems.

It fails silently with confidence. When BM25 fails, you get obviously bad results—completely unrelated documents ranked high. When dense retrieval fails, you get subtly wrong results—documents that are topically related but operationally incorrect. These failures are harder to catch.

Here's a concrete example from our codebase RAG. Query: "How does the retry logic work in user authentication?"

Dense retrieval returns:

Documentation about general retry patterns (semantically related ✓)
Code showing retry logic for API calls (related but different component ✗)
Authentication flow diagram (related but no retry details ✗)

The chunks all make sense individually. They're about retries, or about authentication, or about system architecture. But none of them contain the specific retry logic for authentication. The LLM synthesizes an answer that blends general retry patterns with authentication concepts. Sounds great, completely wrong for our actual implementation.

The correct chunk—buried at rank 8—contains the actual function: def authenticate_with_retry(user, max_attempts=3):. It has "retry" and "authentication" in close proximity. BM25 would have caught it. Dense retrieval ranked it low because the surrounding code wasn't semantically rich enough.

Why Hybrid Retrieval Isn't Optional Anymore

Here's the thing everyone learns eventually: you can't pick between BM25 and dense retrieval. You need both.

They fail in opposite ways:

BM25 is precise but brittle. It finds exactly what you ask for, using exactly the words you used. Great for technical queries with specific terminology. Terrible for natural language questions where phrasing varies.

Dense retrieval is robust but vague. It understands intent across different wordings. Great for exploratory questions. Terrible for maintaining hard constraints on identifiers and symbols.

I've watched multiple teams go through the same progression:

Start with BM25 because it's simple and works
Users complain it's "too literal" and can't handle natural language
Switch to dense retrieval because it's "modern" and handles synonyms
Users complain about wrong answers despite relevant docs existing
Realize they need both and implement hybrid retrieval
Wonder why they didn't just start with hybrid

The answer is always: you learn what each approach is bad at through painful production incidents.

Here's what hybrid retrieval actually looks like in our system:

code

def hybrid_retrieve(query: str, top_k: int = 5) -> List[Chunk]:    # BM25 retrieval - enforces lexical constraints    bm25_results = bm25_index.search(query, k=20)        # Dense retrieval - captures semantic intent    query_embedding = embed_model.encode(query)    dense_results = vector_db.search(query_embedding, k=20)        # Union: get everything either retriever thinks is relevant    # This maximizes recall - we'd rather have extras than miss critical chunks    all_candidates = merge_results(bm25_results, dense_results)        # Fusion: combine scores    # BM25 weight = 0.4, Dense weight = 0.6    # These weights are domain-specific - tune on your data    fused_scores = {}    for chunk in all_candidates:        bm25_score = bm25_results.get(chunk.id, 0)        dense_score = dense_results.get(chunk.id, 0)        fused_scores[chunk.id] = (0.4 * bm25_score + 0.6 * dense_score)        # Top-K by fused score    ranked_chunks = sorted(        all_candidates,         key=lambda c: fused_scores[c.id],         reverse=True    )[:top_k]        return ranked_chunks

This is simplified, but it captures the key insight: BM25 makes sure you don't drop critical facts. Dense retrieval makes sure you don't miss user intent. Fusion balances them.

Performance characteristics (based on our testing):

BM25 search (50k chunks): 15-25ms on t3.medium
Dense search (50k vectors, 384-dim): 20-35ms on t3.medium with Qdrant
Fusion logic: <5ms
Total hybrid retrieval: 40-65ms

The weights (0.4/0.6 here) matter a lot and depend on your domain:

Technical documentation, code, logs: Bias toward BM25 (0.6/0.4 or even 0.7/0.3)
Customer FAQs, conversational support: Bias toward dense (0.3/0.7)
Mixed domains: Start at 0.5/0.5 and measure

How do you measure? Run your hybrid retriever on real queries from production and check:

Precision@5: Of the top 5 chunks, how many are actually relevant?
Recall@20: Of all relevant chunks in the corpus, how many appear in the top 20?
MRR (Mean Reciprocal Rank): On average, what position is the first relevant chunk?

But most importantly: manually inspect failures. When your RAG system gives a wrong answer, look at what got retrieved. Was the right chunk there but ranked low? Was it missing entirely? Did wrong chunks rank high? Each pattern tells you how to adjust.

The Reranker Is Where You Fix Everything

Even with hybrid retrieval, you'll have ordering problems. The right chunk is in your candidate set but it's at position 8, and your context window only fits the top 5. The LLM never sees it.

This is where rerankers matter. A reranker looks at your top-K candidates (typically 10-20) and re-scores them using a more expensive but more accurate model—usually a cross-encoder.

Here's the difference: your initial retrievers (BM25 and dense) encode the query and documents separately. They score based on term overlap or embedding similarity. A cross-encoder processes the query and each candidate together, letting attention mechanisms draw direct connections between them.

Practical example from our system. Query: "What's the timeout for database queries?"

After hybrid retrieval, our top 5 candidates are:

"Database connection timeout is configurable via CONFIG_DB_TIMEOUT" (hybrid score: 0.82)
"Query performance optimization tips" (hybrid score: 0.79)
"Default timeout values for system components" (hybrid score: 0.76)
"The database query timeout is set to 30 seconds by default" (hybrid score: 0.74)
"Troubleshooting slow database queries" (hybrid score: 0.71)

Chunk 4 has the exact answer. But it's ranked 4th because it uses different phrasing—"database query timeout" vs just "timeout for database queries." The word order is different, it doesn't mention "configurable," so both BM25 and dense retrieval scored it lower than chunk 1.

After reranking with a cross-encoder:

"The database query timeout is set to 30 seconds by default" (rerank score: 0.94)
"Database connection timeout is configurable via CONFIG_DB_TIMEOUT" (rerank score: 0.88)
"Default timeout values for system components" (rerank score: 0.72)
"Query performance optimization tips" (rerank score: 0.51)
"Troubleshooting slow database queries" (rerank score: 0.45)

Now chunk 4 is at the top. The cross-encoder saw that "database query timeout" in the chunk directly matches the query's intent, even though the exact phrasing differs. The LLM gets the right context first.

Here's what reranking costs in our setup:

Latency (on moderate hardware):

Hybrid retrieval (BM25 + Dense): ~50ms
Reranking 10 candidates with ms-marco-MiniLM-L6-v2:
- On GPU (T4, batched): ~50-80ms
- On CPU (modern Xeon, batched): ~60-100ms
- On CPU (sequential processing): ~120-200ms
Total pipeline: 130-250ms depending on hardware

Compute Costs:

If you go the GPU route (not necessary for most loads):

AWS g4dn.xlarge (T4 GPU, on-demand): ₹34,558/month (~$384/month)
AWS g4dn.xlarge (spot instances, can be interrupted): ₹10,000-18,000/month
Cross-encoder inference is 10-15x faster on GPU vs CPU

For CPU-only (more realistic for Indian developers):

AWS t3.xlarge (4 vCPU, 16GB RAM): ₹5,400/month
Handles embedding + BM25 + reranking on CPU
Latency: 150-250ms total
Works fine for <5,000 queries/day

Is it worth it?

Depends on your failure mode. We track "hallucination rate"—percentage of answers where the LLM invents facts not in retrieved context:

Before reranking: ~23%
After reranking: ~11%

(These numbers are from our customer support RAG system with ~1,200 manually evaluated query-answer pairs. Your mileage will vary based on domain, model, and query complexity.)

That's a massive improvement. Whether it's worth the extra latency (and cost) depends on your use case.

When reranking helps most:

Your corpus has near-duplicates or similar chunks. Reranker can distinguish subtle differences that initial retrievers miss.

Precision matters more than speed. If a wrong answer is costly (customer support, medical info, legal guidance), the extra latency is worth it.

You're already doing hybrid retrieval. Reranking on top of hybrid gives you diminishing returns if your initial retrieval is terrible—fix that first.

When you can skip reranking:

Your retrieval is already highly precise. If hybrid retrieval consistently puts the right chunk in the top 3, reranking won't help much.

Latency is critical. If you need sub-50ms responses, reranking might not fit your budget.

Your domain is simple. If most queries have obvious answers and few edge cases, simpler retrieval might be enough.

You're on a tight budget. CPU-based reranking adds 100-200ms. If you can't afford that latency, focus on improving hybrid retrieval weights instead.

Here's how to add reranking to the hybrid example:

code

from sentence_transformers import CrossEncoder# Initialize once (loads ~90MB model into memory)reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L6-v2')def hybrid_retrieve_with_rerank(query: str, top_k: int = 5) -> List[Chunk]:    # Step 1: Hybrid retrieval (BM25 + Dense)    # Get 20 candidates to rerank    candidates = hybrid_retrieve(query, top_k=20)        # Step 2: Rerank using cross-encoder    # Create pairs of (query, chunk_text)    pairs = [(query, chunk.text) for chunk in candidates]        # Score all pairs - this is the expensive part    # On CPU (batched): ~60-100ms for 20 pairs    # On GPU (batched): ~50-80ms for 20 pairs    rerank_scores = reranker.predict(pairs)        # Step 3: Sort by rerank score and take top K    ranked = sorted(        zip(candidates, rerank_scores),        key=lambda x: x[1],        reverse=True    )[:top_k]        return [chunk for chunk, score in ranked]

Performance notes based on our testing:

For ms-marco-MiniLM-L6-v2 (22.7M parameters):

Reranking 10 candidates: 60-100ms on CPU (Intel Xeon), 50-80ms on T4 GPU
Reranking 20 candidates: 100-150ms on CPU, 80-120ms on GPU
Reranking 50 candidates: 200-300ms on CPU, 150-200ms on GPU

The latency scales roughly linearly with candidate count, which is why you limit reranking to top 10-20 results from hybrid retrieval.

One thing to watch: rerankers can over-fit to query phrasing. If your query is "How to configure X?" and a chunk says "Configuring X: ...", the reranker will score it very high even if the content isn't actually helpful. This is why you still need hybrid retrieval underneath—rerankers correct ordering mistakes, they don't fix missing constraints.

What Actually Breaks in Production (And How to Debug It)

Let me walk you through debugging a real RAG failure. User query: "Why is user authentication failing with error LDAP_AUTH_001?"

Step 1: Check what got retrieved

code

results = hybrid_retrieve_with_rerank(query, top_k=5)for i, chunk in enumerate(results, 1):    print(f"{i}. [{chunk.score:.3f}] {chunk.text[:100]}...")

Output:

code

1. [0.91] LDAP authentication troubleshooting guide. Common issues include...2. [0.88] User authentication flow diagram. The system first checks...3. [0.84] Error codes reference: LDAP_AUTH_002: Invalid credentials, LDAP_AUTH_003...4. [0.79] Configuring LDAP connection parameters in config.yaml...5. [0.76] Authentication service architecture overview...

Problem: None of these chunks mention LDAP_AUTH_001. Chunk 3 has other LDAP errors but not the one we need. The right document exists but wasn't retrieved.

Step 2: Check if the document exists

code

# Search directly in the corpusgrep_results = corpus.search_text("LDAP_AUTH_001")if grep_results:    print(f"Found {len(grep_results)} chunks with LDAP_AUTH_001")    print(grep_results[0].text)

Output:

code

Found 1 chunk with LDAP_AUTH_001"LDAP_AUTH_001: Connection timeout to LDAP server. Check network connectivity and server status."

So the document exists. Why wasn't it retrieved?

Step 3: Debug BM25

code

bm25_results = bm25_index.search("LDAP_AUTH_001", k=20)for i, result in enumerate(bm25_results, 1):    print(f"{i}. [{result.score:.3f}] {result.text[:80]}...")

Output:

code

1. [8.2] LDAP_AUTH_001: Connection timeout to LDAP server...2. [2.1] LDAP authentication troubleshooting guide...

BM25 found it! Rank 1, high score (8.2). BM25 is working correctly.

Step 4: Debug Dense Retrieval

code

query_emb = embed_model.encode("LDAP_AUTH_001")dense_results = vector_db.search(query_emb, k=20)

Output:

code

1. [0.76] LDAP authentication troubleshooting guide...2. [0.71] User authentication flow diagram......15. [0.43] LDAP_AUTH_001: Connection timeout to LDAP server...

There's the problem. Dense retrieval ranked the right chunk at position 15. Why?

The query is just "LDAP_AUTH_001"—a technical identifier with no natural language context. The embedding model treats it as noise and focuses on semantic similarity to "LDAP" and "authentication," which appear in many chunks. The specific error code doesn't carry enough semantic weight.

Step 5: Check fusion weights

Our fusion logic:

code

fused_score = 0.4 * bm25_score + 0.6 * dense_score

For the correct chunk:

BM25 score: 8.2 (normalized to ~0.95)
Dense score: 0.43
Fused: 0.4 * 0.95 + 0.6 * 0.43 = 0.38 + 0.26 = 0.64

For the wrong chunk (LDAP troubleshooting guide):

BM25 score: 2.1 (normalized to ~0.25)
Dense score: 0.76
Fused: 0.4 * 0.25 + 0.6 * 0.76 = 0.10 + 0.46 = 0.56

Wait, 0.64 > 0.56, so the correct chunk should rank higher. But it doesn't in practice because we take top-20 from each retriever before fusion. The correct chunk ranks 15th in dense results, so it might not make the top-20 union depending on score thresholds.

The fix: For technical queries with identifiers, increase BM25 weight or take more candidates before fusion.

code

# Detect technical query (contains uppercase alphanumeric patterns)import reis_technical = bool(re.search(r'[A-Z_]{3,}[0-9]+', query))if is_technical:    # Bias toward BM25, take more candidates    bm25_weight = 0.7    dense_weight = 0.3    candidate_k = 30else:    bm25_weight = 0.4    dense_weight = 0.6    candidate_k = 20

After this fix, the correct chunk ranks in the top 3 consistently.

Debugging Checklist (What to Actually Check)

When your RAG system gives wrong answers, work through this systematically:

1. Did we retrieve the right document at all?

Search your corpus directly for key terms from the query
If the doc doesn't exist, this is a content problem, not retrieval
If it exists but wasn't retrieved, continue debugging

2. Where does BM25 rank it?

Run BM25-only retrieval on the query
Check if the right doc appears in top 10
If not: check tokenization, stopwords, field weighting

3. Where does dense retrieval rank it?

Run dense-only retrieval
If it's missing or ranked low, check:
- Does the chunk have enough natural language context?
- Are symbols/identifiers being treated as noise?
- Is the chunk too short or too long?

4. How is fusion combining them?

Log both scores before fusion
Check if weight balance makes sense for this query type
Technical queries need more BM25 weight
Natural language queries need more dense weight

5. Is reranking fixing or breaking it?

Compare ranks before and after reranking
Check if reranker is over-fitting to query phrasing
Verify reranker sees full chunk context, not just snippet

6. What's actually in the context window?

Print exactly what you're sending to the LLM
Check for redundancy, contradictions, irrelevant chunks
Verify citations/metadata are preserved

Most RAG failures fall into one of these categories:

Type 1: Missing content - Document doesn't exist in corpus. Fix: add content, improve chunking.

Type 2: BM25 failure - Wrong tokenization, stopword removal, or field weights. Fix: adjust BM25 config.

Type 3: Dense failure - Embedding model doesn't understand domain or treats identifiers as noise. Fix: adjust fusion weights or try domain-specific embeddings.

Type 4: Fusion failure - Wrong balance between BM25 and dense. Fix: tune weights per query type.

Type 5: Reranking failure - Reranker over-fits or under-fits. Fix: try different reranker or skip it.

Type 6: Context assembly failure - Right chunks retrieved but presented poorly. Fix: improve ordering, deduplication, or formatting.

In our experience, 60% of failures are Type 3 (dense retrieval problems), 25% are Type 4 (fusion), and 15% are everything else.

What This Costs in India (Real Numbers)

Let's talk about what hybrid retrieval + reranking actually costs to run. Most tutorials skip this part or give you AWS's list prices without context. Here's what it really looks like if you're building in India.

Option 1: CPU-Only Setup (Most Realistic for Side Projects)

This is what I'd recommend for Indian developers starting out or running side projects:

code

For ~1,000-2,000 queries/day:Embedding model (dense retrieval):- sentence-transformers/all-MiniLM-L6-v2- Runs on CPU fine for this load- AWS t3.medium (2 vCPU, 4GB RAM): ₹2,700/month (~$30/month)- Latency: 20-30ms per queryVector database:- Qdrant or Weaviate- 50k chunks, 384-dim embeddings- AWS t3.small (2 vCPU, 2GB RAM): ₹1,350/month (~$15/month)- Memory: ~2GB for indexBM25 index:- Elasticsearch or custom (Whoosh, BM25S)- Same 50k chunks- Can run on same t3.small as vector DB: ₹0 extra- OR separate t3.small: ₹1,350/monthReranker (optional but recommended):- cross-encoder/ms-marco-MiniLM-L6-v2 on CPU- Adds ~100-150ms latency- Can run on same instance as embeddings: ₹0 extra- OR scale to t3.large for better latency: ₹2,700/month extraTotal: ₹4,000-7,000/month (~$45-80/month)End-to-end latency: 150-250ms

This is actually affordable. You can run a production RAG system for the cost of a nice dinner in Bangalore.

Option 2: GPU-Accelerated (When You Have Revenue)

If you're doing >10,000 queries/day and latency matters:

code

For ~10,000-20,000 queries/day:Embedding model on GPU:- AWS g4dn.xlarge (T4 GPU, 4 vCPU, 16GB RAM)- On-demand: ₹34,558/month (~$384/month) ⚠️- Spot instances (can be interrupted): ₹10,000-18,000/month- Latency: 5-10ms per queryVector database:- Scale to t3.large (2 vCPU, 8GB RAM): ₹3,600/month- OR managed service (Pinecone, Weaviate Cloud): ₹4,000-8,000/monthBM25 index:- Elasticsearch on t3.medium: ₹2,200/monthReranker:- Runs on same GPU as embeddings: ₹0 extra- Adds ~50-80ms latency (batched)Total with on-demand GPU: ₹40,000-48,000/month (~$445-535/month)Total with spot GPU: ₹16,000-32,000/month (~$180-355/month)End-to-end latency: 50-100ms

Reality check: GPU instances are expensive. The g4dn.xlarge at ₹34,558/month is more than many fresher Indian developers' monthly salaries. Spot instances help (₹10,000-18,000/month) but can be interrupted with 2-minute warning.

Option 3: Hybrid Approach (Smart Scaling)

This is what we actually do in production:

code

CPU for routine queries:- t3.xlarge (4 vCPU, 16GB RAM): ₹5,400/month- Handles 80% of queries (simple, low-traffic periods)- Latency: 150-200msGPU for peak traffic (via autoscaling):- g4dn.xlarge spot instances- Only spins up during high traffic- Average ~30% uptime: ₹3,000-5,000/month- Latency during peaks: 50-80msTotal: ₹8,400-10,400/month (~$95-115/month)

This balances cost and performance. Most queries run on cheap CPU. GPU kicks in only when needed.

Comparison: Self-Hosted vs Managed Services

For the same 10,000 queries/day workload:

Self-hosted (CPU-only):

Infrastructure: ₹6,000-9,000/month
Control: Full
Latency: 150-250ms

Self-hosted (GPU spot):

Infrastructure: ₹16,000-32,000/month
Control: Full, but spot can be interrupted
Latency: 50-100ms

OpenAI Embeddings + Pinecone:

OpenAI embeddings: ~₹900/month (10k queries × 1k tokens × $0.0001/1k)
Pinecone: ₹4,000/month (50k vectors)
Total: ₹4,900/month
Control: Limited
Latency: 80-150ms (network dependent)

Verdict: For side projects or early-stage products in India, managed services or CPU-only self-hosting make more sense. GPU infrastructure is a luxury you buy when you have revenue, not when you're experimenting.

What We Actually Run

Our production setup for a customer support RAG (moderate traffic, ~8,000 queries/day):

AWS t3.xlarge for CPU inference: ₹5,400/month
Elasticsearch t3.small for BM25: ₹1,350/month
Qdrant (self-hosted) on same t3.xlarge: ₹0 extra
Reranker on CPU: ₹0 extra (same instance)
Autoscaling g4dn.xlarge spot for peak hours: ~₹4,000/month average

Total: ₹10,750/month (~$120/month)

End-to-end latency:

Off-peak (CPU): 180-220ms
Peak (GPU): 60-90ms

This works because:

Most customer support queries aren't latency-critical (users will wait 200ms)
Peak traffic is predictable (business hours in India)
Spot instances are reliable enough for our SLA
We'd rather spend money on content quality than GPU cycles

The India Reality

Here's what most tutorials won't tell you:

GPU access is not a given. A g4dn.xlarge at ₹34,558/month is expensive for Indian startups. That's real money when your pricing is in INR but infrastructure is in USD.

CPU-first is valid. You don't need GPU for RAG to work. It's a latency optimization, not a correctness requirement. Start with CPU, measure actual latency requirements, scale only when needed.

Spot instances are your friend. If you do need GPU, spot instances cost 60-70% less. Yes, they can be interrupted. No, it's not usually a problem if you handle failures gracefully and have CPU fallback.

Managed services are competitive. When you factor in engineering time, Pinecone + OpenAI embeddings is often cheaper than self-hosting GPU infrastructure—especially at small scale.

Don't optimize prematurely. Build with CPU. Measure actual query load and latency requirements. Scale to GPU only when you have data showing it's necessary. Most RAG systems never need GPU inference.

Choosing Embedding Models (What Actually Matters)

Every tutorial tells you to use the "best" embedding model from MTEB leaderboard. In production, the best model is the one that works for your data, not the one with the highest benchmark score.

Here's what actually matters:

1. Does it handle your domain's symbols?

Test it:

code

from sentence_transformers import SentenceTransformermodel = SentenceTransformer('all-MiniLM-L6-v2')# Test with technical identifierstest_queries = [    "LDAP_AUTH_001",    "Error code E4392",    "/api/v2/users/authenticate",    "CONFIG_MAX_RETRIES",]test_docs = [    "LDAP_AUTH_001: Connection timeout",    "LDAP_AUTH_002: Invalid credentials",    "Error code E4392 indicates connection pool exhausted",    "API endpoint /api/v2/users/authenticate handles login",]for query in test_queries:    query_emb = model.encode(query)    doc_embs = model.encode(test_docs)    # Compute similarities    from sklearn.metrics.pairwise import cosine_similarity    sims = cosine_similarity([query_emb], doc_embs)[0]        print(f"\nQuery: {query}")    for doc, sim in zip(test_docs, sims):        print(f"  {sim:.3f} | {doc[:50]}")

If your exact-match docs don't score highest, your embedding model is dropping symbols on the floor. This is common with general-purpose models.

2. Does it preserve short-chunk quality?

Many embeddings degrade on chunks shorter than 50 tokens. Test with your actual chunk sizes.

3. Can you run it on your infrastructure?

Model size matters:

all-MiniLM-L6-v2: 80MB, 384 dims, CPU-friendly
all-mpnet-base-v2: 420MB, 768 dims, needs more RAM
instructor-large: 1.3GB, 768 dims, GPU recommended

Don't pick a model you can't actually deploy.

4. What's the latency in production?

Benchmark on your hardware:

code

import timemodel = SentenceTransformer('all-MiniLM-L6-v2')test_queries = ["example query"] * 100start = time.time()embeddings = model.encode(test_queries, batch_size=32)elapsed = time.time() - startprint(f"Avg latency: {elapsed/100*1000:.1f}ms per query")

Expected results (approximate):

On AWS t3.medium (2 vCPU): 25-35ms per query
On AWS t3.xlarge (4 vCPU): 15-25ms per query
On g4dn.xlarge (T4 GPU): 5-10ms per query

If it's >100ms on CPU, you need GPU or a smaller model. But remember: 25-30ms embedding latency is acceptable for most RAG systems. Don't prematurely optimize.

Domain-specific vs general models:

General models (SBERT, MiniLM): good starting point, handle diverse queries reasonably well.

Domain-specific (legal-bert, bio-bert, etc.): higher precision in narrow domains but worse generalization.

For most RAG systems, start with general models. Switch to domain-specific only if:

You have strong evidence general models are failing
Your entire corpus is in one domain
You can afford maintaining domain-specific infrastructure

Final Advice for RAG Engineers

Start with hybrid retrieval, not dense-only. The extra complexity is worth it. You'll end up adding BM25 later anyway after the first major hallucination incident.

Don't trust benchmark scores. Your retrieval quality is determined by your queries and your corpus, not MTEB leaderboard rankings. Measure on your data.

Debug retrieval before touching prompts. When answers are wrong, 80% of the time it's because the right context wasn't retrieved. Fix that first.

Make retrieval observable. Log what was retrieved, why it was ranked that way, and what the LLM saw. You can't improve what you can't measure.

Treat reranking as precision tuning. If your initial retrieval is broken, reranking won't save you. Get hybrid retrieval working first, then add reranking for the last 10-20% quality improvement.

Know your failure modes. BM25 fails on paraphrases. Dense fails on symbols. Fusion fails on wrong weights. Reranking fails on short chunks. Each one breaks differently, and you need to recognize the patterns.

The best RAG systems aren't built by people who use the fanciest models. They're built by people who understand exactly how their retrieval breaks and have debugging workflows to catch it early.

Related articles: Searching / Indexing / RAG Series