HNSW Vector Search Recall Failures in Production

Your RAG system's HNSW (Hierarchical Navigable Small World) vector search ships with 0.95 recall@10 in testing. Three months later, your users are complaining the assistant "doesn't know anything." You check latency - fine. You check error rates - zero. You re-chunk the corpus, swap the embedding model, and open five GitHub issues. Nothing helps. You were debugging the wrong layer.

The embedding is not the problem. The index is.

Specifically: the index was tuned to a query distribution that does not exist in production. Your benchmarks tested common, well-represented queries. Your users are asking the long-tail ones. And HNSW - which your vector DB ships as the default, which ann-benchmarks.com shows dominating the leaderboard - degrades precisely on underrepresented neighborhoods in the graph.

This is the Index-Access Pattern Mismatch: when the algorithm that wins on benchmark datasets fails on your production query distribution. It is the most common silent failure mode in production vector search, and almost nobody monitors for it.

Why Every Team Ships HNSW Without Measuring Recall

The default path to vector search in 2026 looks like this: spin up Pinecone, Weaviate, Qdrant, pgvector, or Milvus; insert your embeddings; get search results. Every one of these systems defaults to HNSW. You never chose it. You never tuned it. You never measured whether it fits your workload.

HNSW deserves its reputation. Malkov and Yashunin's algorithm builds a hierarchical graph where each vector connects to its approximate nearest neighbors across multiple layers. Top layers carry long-range edges for fast traversal; bottom layers carry short-range edges for local refinement. The result is logarithmic search complexity with recall consistently above 95% out of the box on standard benchmarks.

The problem is what "standard benchmarks" means.

Ann-benchmarks.com - the canonical reference for ANN algorithm comparison - tests on datasets like glove-100-angular, SIFT1M, and DEEP1B. These datasets have relatively uniform query distributions: the benchmark queries are drawn from the same distribution as the indexed vectors. Your RAG system does not work this way. Your corpus is non-uniform - some topics appear in hundreds of chunks, others in one. Your queries are non-uniform - some questions are common and well-represented in the embedding space, others are rare, specific, or cross-domain.

HNSW's greedy graph traversal is fast precisely because it short-circuits. When the query falls near a dense cluster, the graph routes there efficiently. When the query falls near a sparse region - a long-tail topic with few indexed neighbors - the greedy traversal gets stranded in the nearest dense cluster instead. The algorithm "succeeds" by returning a result. The result has high cosine similarity. The result is wrong.

The Towards Data Science controlled experiment (January 2026) documented this precisely: for fixed ef_search values, HNSW recall degrades faster than flat search as corpus size grows. The failure is not sudden - it is a gradient that starts around 200k vectors and compounds. By 1M vectors, the gap between HNSW recall and exact search widens materially on the queries that matter most.

The ANN Benchmark Trap: Why glove-100 Results Don't Transfer to Production

Before diagnosing your own system, it is worth understanding why the benchmarks are structurally misleading.

ScaNN is the clearest example. Google published the ICML 2020 paper "Accelerating Large-Scale Inference with Anisotropic Vector Quantization" and demonstrated that on the 2020 glove-100-angular benchmark, running single-threaded on x86, ScaNN handled roughly twice as many queries per second as the next-fastest library at equivalent accuracy. This is real. The paper is sound. The performance improvement is genuine.

The problem is what ScaNN optimizes for: Maximum Inner Product Search (MIPS) on x86 processors with AVX support. Anisotropic vector quantization works by penalizing quantization errors that are parallel to the database vectors more heavily than orthogonal errors - a principled insight that aligns the compression objective with the actual retrieval goal. This matters a lot for inner product distance. It matters less for L2 distance. And it requires x86/AVX hardware.

If your production system runs on ARM instances (AWS Graviton, Apple Silicon), ScaNN's benchmark performance does not translate. If your distance metric is L2 rather than inner product, the anisotropic advantage shrinks. If your query distribution is not glove-100-angular, you are extrapolating.

The Benchmark Trap is treating leaderboard position as a proxy for fit. The right question is never "which algorithm wins on ann-benchmarks.com?" It is "which algorithm wins on my corpus, my query distribution, my hardware, my recall target, and my latency budget?"

Nobody asks that question at prototype stage. By production stage, the index is already deployed and the failure is already silent.

FAISS, HNSW, IVF-PQ, ScaNN, and DiskANN: What Each Index Actually Trades Off

Before getting to the right selection framework, the mechanics need to be clear. There are three index families, and they fail in different ways.

Graph-based: HNSW and Vamana

HNSW (HNSWlib, pgvector, Weaviate, Qdrant, Milvus) builds a multi-layer proximity graph. Memory cost is significant: for 768-dimensional float32 vectors with M=40 connections, each vector costs approximately 3,232 bytes - about 4.8 GB per million vectors. With 32 edges per vector across layers, the graph alone adds 128-256 bytes per vector. At 100M vectors, the graph overhead is 12-25 GB before you count the vectors themselves.

The recall-latency tradeoff is controlled by ef_search. Higher ef_search means the algorithm considers more candidates before returning results, improving recall at the cost of latency. The Redis production benchmarks (March 2026) measured this directly: going from 0.8 to 0.95 recall increases HNSW latency by roughly 31%. That is manageable. The problem is what happens at 0.95 to 0.99 recall - the curve steepens sharply and latency can grow 3-5x.

DiskANN (Microsoft, now in SQL Server 2025) uses the Vamana graph algorithm but stores the full index on NVMe SSD with compressed vectors cached in RAM. The result: 1 billion vectors at 95% recall with 5ms latency on a single node, versus HNSW's practical ceiling of 100-200M vectors for comparable performance. The memory footprint drops by roughly 90% compared to in-memory HNSW. The cost is SSD I/O latency - DiskANN is slower than in-memory HNSW for small corpora but the crossover point arrives well before a billion vectors.

Cluster-based: IVF and IVF-PQ

IVF (Inverted File Index) partitions your vectors into k-means clusters. At query time, you probe the nprobe nearest cluster centroids and search only within those clusters. The recall-latency knob is nprobe: higher means more clusters searched, better recall, more latency.

IVF-PQ adds Product Quantization on top: each vector is divided into M sub-vectors and each sub-vector is quantized to one of K centroids. The result is aggressive compression - a 10M vector dataset that would need 38 GB as raw float32 vectors fits in approximately 3.5 GB with IVF-PQ. The memory reduction is real. The cost is quantization error: the compressed vectors introduce approximation noise that sits on top of the approximation already introduced by the IVF clustering.

IVF has one operational gotcha that bites teams repeatedly: the index is static. You cannot insert a vector into an IVF index without rebuilding the cluster centroids. For RAG systems with frequently updated knowledge bases, this means either accepting stale index states or scheduling expensive rebuilds. FAISS indexed 1.5 trillion vectors for Meta's internal applications - but Meta has the infrastructure to manage periodic rebuilds. Most teams do not.

Quantization-based: ScaNN

ScaNN is its own category: it combines space partitioning for candidate pruning with anisotropic quantization for fast distance estimation. The benchmark wins are real at scale on MIPS workloads. But ScaNN's operational profile is research-grade: x86/AVX only, TensorFlow and NumPy APIs, optimized for batch throughput rather than single-query latency. Spotify Voyager, which serves 1+ billion music embeddings at 1M QPS in production, uses HNSW with memory-mapped indices and SIMD acceleration rather than ScaNN - a strong signal about what "production-ready" means at scale.

The Index-Access Pattern Mismatch in Production

Here is the failure mode in concrete terms.

code

# Wrong: default HNSW shipped from prototypeimport chromadbclient = chromadb.Client()collection = client.create_collection("knowledge_base")# 500k documents inserted, HNSW built with defaults# ef_construction=100, M=16 - whatever the library shippedcollection.add(documents=docs, embeddings=embeddings, ids=ids)# Works fine in testing - queries are common topicsresult = collection.query(    query_embeddings=[embed("What is machine learning?")],    n_results=10)# recall@10: 0.94. Looks great.# Fails in production - rare, specific, long-tail queryresult = collection.query(    query_embeddings=[embed("What is the depreciation schedule for AWS Lambda provisioned concurrency costs under Reserved Capacity pricing in us-east-1?")],    n_results=10)# recall@10: 0.41. Silent. No error. High cosine similarity on wrong results.

The failure is invisible in your monitoring because:

The query returns results - no exception, no timeout
Cosine similarity scores look healthy (0.82+)
Latency is within SLO
Average recall across all queries is still 0.91 - the long-tail queries are a small fraction of traffic

What you are missing is Recall Horizon: the corpus size threshold at which HNSW recall starts drifting faster than flat search for fixed ef_search. This threshold is not a property of the algorithm in isolation - it is a function of your specific corpus density distribution. For a knowledge base where coverage is uniform, the Recall Horizon may not matter until 1M+ vectors. For a knowledge base where some topics have 500 chunks and others have 2, the Recall Horizon may arrive at 100k vectors.

The correct diagnostic is not average recall. It is per-query-type recall, segmented by query frequency bucket.

code

# Right: instrument recall by query type before it breaks in productionimport numpy as npimport faissdef measure_recall_by_frequency_bucket(    query_embeds: list[np.ndarray],  # pre-computed float32 query embeddings    query_freq_labels: list[str],    # "head", "torso", "tail" for each query    hnsw_index: faiss.IndexHNSWFlat, # your production FAISS HNSW index    exact_index: faiss.IndexFlatIP,  # exact oracle over corpus sample    k: int = 10) -> dict[str, float]:    """    Compare HNSW recall vs exact search, segmented by query frequency bucket.    Head: common, well-represented topics. Tail: rare, specific, long-tail.    Both indexes must have been built with L2-normalized vectors (cosine via IP).    """    results: dict[str, list[float]] = {"head": [], "torso": [], "tail": []}    for q_embed, label in zip(query_embeds, query_freq_labels):        q = q_embed.reshape(1, -1).astype(np.float32)        faiss.normalize_L2(q)        # Ground truth: exact nearest neighbors        _, exact_ids = exact_index.search(q, k)        # Approximate: HNSW neighbors        _, hnsw_ids = hnsw_index.search(q, k)        recall = len(set(exact_ids[0]) & set(hnsw_ids[0])) / k        results[label].append(recall)    return {        bucket: float(np.mean(recalls))        for bucket, recalls in results.items()        if recalls    }# Expected output for a 500k-doc corpus with default HNSW ef_search=16:# {"head": 0.94, "torso": 0.87, "tail": 0.43}# The average (0.75) masks the tail collapse entirely.

Here is the unified right-way setup - index selection, launch validation, and production monitoring as a single coherent pattern:

code

# Right: full production vector search setupimport numpy as npimport faissfrom collections import defaultdict# --- Step 1: select your index based on workload, not benchmarks ---def select_vector_index(    corpus_size: int,    update_freq: str,           # "static", "daily", "realtime"    recall_target: float,       # 0.90, 0.95, 0.99    filter_selectivity: float,  # fraction of corpus REMAINING after typical filter    memory_budget_gb: float,    latency_budget_ms: float,    embedding_dims: int = 1536, # adjust for your model) -> dict:    raw_gb_per_1m = (embedding_dims * 4 * 1_000_000) / 1e9    if corpus_size > 500_000_000:        return {"index": "DiskANN", "rationale": "Billion-scale, 95% recall @ 5ms on NVMe"}    if memory_budget_gb < raw_gb_per_1m * (corpus_size / 1_000_000) * 0.5:        return {"index": "IVF-PQ", "rationale": "10x memory compression; plan periodic rebuilds"}    if filter_selectivity < 0.1 and corpus_size > 100_000:        return {"index": "Qdrant ACORN or Milvus Cardinal", "rationale": "HNSW disconnects under high-selectivity filters"}    return {"index": "HNSW", "rationale": "Default for under 50M vectors with active writes"}# --- Step 2: validate tail recall before launch ---def validate_launch_recall(    head_queries: list[np.ndarray],         # common query embeddings, float32    tail_queries: list[np.ndarray],         # rare/long-tail query embeddings, float32    hnsw_index: faiss.IndexHNSWFlat,        # your production FAISS HNSW index    sample_embeddings: np.ndarray,          # 1% corpus sample, float32 (will be normalized)    k: int = 10,    tail_recall_floor: float = 0.85,) -> dict:    exact = faiss.IndexFlatIP(sample_embeddings.shape[1])    normed = sample_embeddings.copy().astype(np.float32)    faiss.normalize_L2(normed)    exact.add(normed)    def recall_for(queries):        recalls = []        for q in queries:            q_norm = q.reshape(1, -1).copy().astype(np.float32)            faiss.normalize_L2(q_norm)            _, exact_ids = exact.search(q_norm, k)            _, hnsw_ids = hnsw_index.search(q_norm, k)            recalls.append(len(set(exact_ids[0]) & set(hnsw_ids[0])) / k)        return float(np.mean(recalls))    head_recall = recall_for(head_queries)    tail_recall = recall_for(tail_queries)    return {        "head_recall": head_recall,        "tail_recall": tail_recall,        "launch_blocked": tail_recall < tail_recall_floor,        "message": (            f"BLOCK: tail recall {tail_recall:.2f} below floor {tail_recall_floor}"            if tail_recall < tail_recall_floor            else f"PASS: tail recall {tail_recall:.2f}"        ),    }# --- Step 3: continuous production recall monitoring ---class RecallMonitor:    def __init__(self, sample_embeddings: np.ndarray, sample_ids: list[str]):        """        Exact-search oracle over a 1% corpus sample.        sample_embeddings: shape (n, dim), float32, L2-normalized in-place.        """        self.dim = sample_embeddings.shape[1]        self.exact_index = faiss.IndexFlatIP(self.dim)        faiss.normalize_L2(sample_embeddings)  # normalize before adding        self.exact_index.add(sample_embeddings)        self.sample_ids = sample_ids        self.recall_log: dict[str, list[float]] = defaultdict(list)    def log_query(        self,        query_embed: np.ndarray,        hnsw_result_ids: list[str],        query_label: str,  # "head", "torso", "tail"        k: int = 10,    ) -> float:        q = query_embed.reshape(1, -1).astype(np.float32)        faiss.normalize_L2(q)        _, exact_indices = self.exact_index.search(q, k)        exact_ids = {self.sample_ids[i] for i in exact_indices[0] if i >= 0}        recall = len(set(hnsw_result_ids) & exact_ids) / k        self.recall_log[query_label].append(recall)        return recall    def report(self) -> dict[str, float]:        return {            label: float(np.mean(recalls))            for label, recalls in self.recall_log.items()            if recalls        }

The pattern is: select by workload → validate tail recall before launch → monitor tail recall in production. All three steps together. Skipping any one of them is how the Index-Access Pattern Mismatch gets through to users.

The head queries make the average look acceptable. The tail queries - which are often the highest-value queries in a knowledge management RAG system - are failing at rates that would be unacceptable if they were visible.

How to Choose Your ANN Index: A Production Selection Framework

The wrong way to choose an index is to benchmark on ann-benchmarks.com and pick the winner.

The right way is to characterize your workload on four dimensions before writing a line of indexing code:

1. Corpus size and update frequency

For corpora under 5M vectors with active writes: HNSW. It absorbs inserts without rebuilds, recall is high out-of-box, and the memory overhead is manageable. For corpora under 50M vectors on PostgreSQL: pgvector with HNSW (after the 0.7+ release series added parallel builds). For corpora over 100M vectors, mostly static: IVF-PQ. The memory compression is necessary and the static nature of the index is acceptable. For corpora approaching 1B vectors: DiskANN. It is the only algorithm that hits 95% recall at 5ms latency at that scale without requiring a GPU cluster.

2. Query distribution shape

Sample your production queries - or planned query types - before building the index. Cluster them. Measure how many chunks in your corpus are semantically near each cluster. If the distribution is heavy-tailed (few common topics, many rare ones), increase ef_search well above default and run tail recall separately. If more than 30% of your expected query volume is long-tail, consider whether HNSW defaults will hold at your target corpus size.

3. Filter selectivity

Filtered vector search is where the standard algorithms fail most visibly. When a metadata filter eliminates 90%+ of your corpus before vector search, HNSW graph traversal hits disconnected "islands" - regions of the graph cut off by the filter. Post-filtering (apply filter after ANN search) produces incomplete result sets. Pre-filtering (pass a binary mask into ANN search) does not scale because the mask grows linearly with corpus size.

Qdrant's ACORN algorithm handles this case specifically, maintaining recall under heavy filtering by adapting traversal strategy. If your workload has highly selective filters (date ranges, tenant isolation, category filters that exclude large corpus fractions), test your recall at maximum filter selectivity, not average.

4. Recall target and latency budget

The 0.95 recall target that looks comfortable in testing can require a 2-3x latency increase to maintain at scale. Decide your recall floor before deployment, not after. For medical records similarity, legal document search, or fraud pattern matching, you want 0.98+ recall and must size ef_search accordingly. For consumer product search where approximate results are acceptable, 0.90 may be fine. The number matters because it directly determines your ef_search setting, your memory sizing, and whether IVF-PQ's quantization error is within your tolerance.

The pattern applies regardless of which database you deploy. The workload characterization happens before the database choice, not after.

Monitoring That Actually Catches This

Standard vector search monitoring misses the Index-Access Pattern Mismatch because it tracks the wrong metrics. Latency is fine. Average recall looks fine. Error rate is zero. The system appears healthy while a quarter of your queries return garbage.

The monitoring setup that catches this:

Tiered recall measurement. Maintain a small exact-search index (FAISS IndexFlatL2) over a random 1% sample of your corpus. Route 1% of production queries to both your HNSW index and the exact index. Compute recall@10 against exact results. Segment by query type or estimated rarity. Alert when tail recall drops below threshold.

Embedding drift detection. When you upgrade your embedding model, your index is stale. The graph was built for the old embedding space. Query embeddings are now in a different distribution. Measure cosine similarity distribution between queries and returned results over time. A shift in the distribution without a corpus update is a signal of embedding drift.

Index freshness. For HNSW under high write load, frequent inserts degrade the graph's "small world" properties. Track time since last full reindex alongside recall metrics. If recall is degrading and the index is old relative to corpus growth rate, schedule a rebuild.

Filter recall separately. For any query that uses metadata filters, track recall independently from unfiltered queries. Filtered recall degrades differently and earlier.

The from the unified pattern above is the implementation - route 1% of production queries through it and segment by query label.

Choosing Your Vector DB: FAISS, Qdrant, Pinecone, pgvector, and Milvus Compared

The library-versus-managed-service split maps directly onto the tradeoffs above.

FAISS (Meta, open source) is a library, not a database. It gives you direct control over index type (Flat, IVF, HNSW, IVF-PQ, CAGRA via NVIDIA cuVS) and parameters. GPU acceleration via cuVS enables CAGRA index builds 12x faster than CPU HNSW. The operational cost is yours: no replication, no sharding, no query routing. The right choice when you need maximum control or are embedding vector search inside a larger system.

HNSWlib (standalone library) is the reference HNSW implementation. Lower-level than FAISS. No GPU support. Used directly when you want HNSW without FAISS's additional overhead and index type complexity.

pgvector (PostgreSQL extension, 0.7+) makes sense when you are already on PostgreSQL and your corpus stays under ~50M vectors. The 0.7+ release added parallel HNSW index builds and improved memory management. The production ceiling is single-node PostgreSQL limits, not pgvector itself. The integration cost is zero if you are already on Postgres.

Qdrant delivers the best price-performance ratio in self-hosted configurations in 2026. Its ACORN algorithm handles filtered ANN better than standard HNSW implementations. Self-hosted on commodity hardware handles 10M+ vectors easily. The managed Qdrant Cloud is competitively priced relative to Pinecone.

Weaviate is the right choice for hybrid search (dense + sparse) in a managed deployment. The module system adds configuration complexity, and self-hosted Weaviate needs Kubernetes expertise. But native BM25 + vector hybrid search without maintaining a separate Elasticsearch instance is a meaningful operational simplification.

Pinecone is pure dense vector search with zero operational overhead. No filtered ANN as sophisticated as Qdrant's. Adding BM25 requires a separate ES/OpenSearch instance. The managed simplicity comes at a pricing premium - self-hosted Qdrant at $30-50/month handles equivalent workloads that Pinecone charges an order of magnitude more for.

Milvus/Zilliz supports all three index families (HNSW, IVF-PQ, DiskANN) through a unified API. The Cardinal engine (Zilliz Cloud) uses dynamic filter strategies and adapts traversal method based on filter selectivity data statistics. The right choice at scale when you need to switch index types without migrating data.

Chroma is a prototype tool. Its in-process mode does not support concurrent writes from multiple processes (cloud offering unproven at scale as of 2026). Use it for local development and offline processing. Do not ship it to production under write concurrency.

mermaid

flowchart TD
    A[New vector search workload] --> B{Corpus size?}
    B -->|Over 500M vectors| C[DiskANN\nVamana + NVMe SSD\n95% recall @ 5ms\n~0.5 GB per 1M vecs RAM]
    B -->|50M to 500M| D{Memory constrained?}
    B -->|Under 50M| E{Already on PostgreSQL?}
    D -->|Yes| F[IVF-PQ\n~0.35 GB per 1M vecs\nSchedule periodic rebuilds\nRecall degrades without nprobe tuning]
    D -->|No| G{Update frequency?}
    G -->|Static or daily| F
    G -->|Realtime inserts| H[HNSW\n~4.8 GB per 1M vecs\nDiskANN if RAM is tight]
    E -->|Yes| I[pgvector HNSW\nZero infra cost\n~50M vector ceiling\nTune ef_search from day one]
    E -->|No| J{Filter selectivity?}
    J -->|Over 90% excluded| K[Qdrant ACORN\nor Milvus Cardinal\nAdaptive filter traversal\nValidate at max selectivity]
    J -->|Under 90% excluded| L[HNSW\n~4.8 GB per 1M vecs\nPinecone / Weaviate / Qdrant\nby ops preference]

    style C fill:#4A90E2,color:#fff,stroke:#2c6fad
    style F fill:#9B59B6,color:#fff,stroke:#6c3483
    style H fill:#4A90E2,color:#fff,stroke:#2c6fad
    style I fill:#6BCF7F,color:#fff,stroke:#1e8449
    style K fill:#FFD93D,color:#333,stroke:#b7950b
    style L fill:#98D8C8,color:#333,stroke:#148f77

Pre-Launch Checklist for Vector Search

Before any vector search system goes to production:

Sample your production query distribution. Not synthetic queries - actual or representative ones. Cluster them. Identify the tail.
Run recall@10 segmented by query type. Head vs. torso vs. tail. Alert threshold at tail recall < 0.85.
Size your index for target recall, not average recall. Set ef_search against your tail recall target, then measure latency.
Test filtered recall at maximum selectivity. If your application filters by tenant, date, or category, test recall when the filter excludes 90%+ of corpus.
Establish your Recall Horizon. Run validate_launch_recall at 10%, 50%, and 100% of target corpus size. If tail recall drops 10+ points from 10% to 100% corpus, you have crossed the Recall Horizon for your distribution and need to increase ef_search or migrate index type before launch.
Set up exact-search sampling in production. 1% of queries against an exact index over 1% corpus sample. This is your early warning system.
Plan your reindex schedule. For IVF-PQ: after every 20% corpus update. For HNSW: when graph fragmentation metrics degrade (track via recall sampling).
Validate your vector DB's filter strategy. Does it use pre-filter, post-filter, or adaptive? What happens at 99% exclusion? Test it before it matters.

What to Do Now: Fixing Vector Search Recall in Your Current System

If you are early: start with pgvector if you are on PostgreSQL, Qdrant otherwise. Set ef_search higher than default from day one (64 minimum, 128 for recall-critical applications). Build recall monitoring before you build features.

If you are already in production with degrading recall: the fastest fix is increasing ef_search and measuring whether the latency hit is within your SLO. If increasing ef_search to maintain recall pushes p95 latency past your budget, you have outgrown your current index configuration and need to evaluate DiskANN for scale or IVF-PQ for memory compression.

If you have filtering problems: Qdrant's ACORN and Milvus's Cardinal are the two production-grade solutions in 2026. Do not try to fix filtered HNSW recall through parameter tuning alone - the algorithm is not designed for high-selectivity filters.

The ann-benchmarks.com leaderboard is a starting point, not a selection criterion. The algorithm that wins on glove-100 in a single-threaded CPU benchmark may not be the right algorithm for your query distribution, your hardware, your corpus size, or your update pattern. That mismatch between what the benchmark measures and what your production system needs is the Index-Access Pattern Mismatch - and the only defense against it is instrumenting for it before your users find it for you.

Your index is the second embedding model in your RAG pipeline. It decides what your LLM sees as truth. Treat it with the same scrutiny you give the embedding model itself.

References

Guo, R., Sun, P., Lindgren, E., et al. (2020). Accelerating Large-Scale Inference with Anisotropic Vector Quantization. ICML 2020. https://arxiv.org/abs/1908.10396
Johnson, J., Douze, M., & Jégou, H. (2019). Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3). https://arxiv.org/abs/1702.08734
Jayaram Subramanya, S., et al. (2019). DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node. NeurIPS 2019. https://suhasjs.github.io/files/diskann_neurips19.pdf
Aumüller, M., Bernhardsson, E., & Faithfull, A. (2019). Ann-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Information Systems, 87. https://ann-benchmarks.com/
arxiv:2507.00379 (2025). Towards Robustness: A Critique of Current Vector Database Assessments. https://arxiv.org/html/2507.00379v1
arxiv:2602.21600 (2026). AQR-HNSW: Accelerating ANN Search via Density-aware Quantization and Multi-stage Re-ranking. https://arxiv.org/pdf/2602.21600
Microsoft Research. DiskANN: Vector Search at Web Scale. https://www.microsoft.com/en-us/research/project/project-akupara-approximate-nearest-neighbor-search-for-large-scale-semantic-search/
Google Research. (2020). Announcing ScaNN: Efficient Vector Similarity Search. https://research.google/blog/announcing-scann-efficient-vector-similarity-search/
Milvus Blog. (2025). Vector Search in the Real World: How to Filter Efficiently Without Killing Recall. https://milvus.io/blog/how-to-filter-efficiently-without-killing-recall.md
Redis. (2026). Vector Database Challenges: What Breaks in Production. https://redis.io/blog/common-challenges-working-with-vector-databases/
Weaviate. Resource Planning: Memory Requirements. https://weaviate.io/developers/weaviate/concepts/resources
FAISS Official Documentation. https://faiss.ai/
Qdrant Benchmarks. https://qdrant.tech/benchmarks/
ANN Benchmarks Project. https://ann-benchmarks.com/
Big-ANN-Benchmarks NeurIPS23 Challenge. https://github.com/harsha-simhadri/big-ann-benchmarks

AI Engineering

RAG Systems

Follow for more technical deep dives on AI/ML systems, production engineering, and building real-world applications:

HNSW Vector Search Recall Failures in Production

Why Every Team Ships HNSW Without Measuring Recall

The ANN Benchmark Trap: Why glove-100 Results Don't Transfer to Production

FAISS, HNSW, IVF-PQ, ScaNN, and DiskANN: What Each Index Actually Trades Off

Graph-based: HNSW and Vamana

Cluster-based: IVF and IVF-PQ

Quantization-based: ScaNN

The Index-Access Pattern Mismatch in Production

How to Choose Your ANN Index: A Production Selection Framework

Monitoring That Actually Catches This

Choosing Your Vector DB: FAISS, Qdrant, Pinecone, pgvector, and Milvus Compared

Pre-Launch Checklist for Vector Search

What to Do Now: Fixing Vector Search Recall in Your Current System

References

Books by Ranjan Kumar

Building Real-World Agentic AI Systems

The ChatML Handbook

The Chat Templates Handbook

Comments

Why Every Team Ships HNSW Without Measuring Recall

The ANN Benchmark Trap: Why glove-100 Results Don't Transfer to Production

FAISS, HNSW, IVF-PQ, ScaNN, and DiskANN: What Each Index Actually Trades Off

Graph-based: HNSW and Vamana

Cluster-based: IVF and IVF-PQ

Quantization-based: ScaNN

The Index-Access Pattern Mismatch in Production

How to Choose Your ANN Index: A Production Selection Framework

Monitoring That Actually Catches This

Choosing Your Vector DB: FAISS, Qdrant, Pinecone, pgvector, and Milvus Compared

Pre-Launch Checklist for Vector Search

What to Do Now: Fixing Vector Search Recall in Your Current System

References

Related Articles

Books by Ranjan Kumar

Building Real-World Agentic AI Systems

The ChatML Handbook

The Chat Templates Handbook

Comments