When Vector Search Isn't Enough
Semantic search solves one problem. Hybrid retrieval solves the problem you actually have.
Everyone’s running vector search. Embed the docs, query with cosine similarity, pipe the top-k into the context window. It works. Until it doesn’t.
Here’s the failure pattern: your retrieval looks great during demos. You’re pulling semantically relevant chunks. The model gives coherent answers. Then you go to production and someone asks “what’s the current price for SKU-44821” or “list every contract signed in Q3 2024” and the whole thing falls apart.
Vector search is optimized for meaning. Not keywords. Not exact values. Not structured lookups. When your user wants something specific — a product code, a policy number, a date range — semantic similarity is the wrong tool. You need exact match. You need BM25. You need filters.
That’s what hybrid retrieval actually means: combining dense vector search with sparse keyword search and structured metadata filtering, then merging the results intelligently. Not as a fallback. As the default architecture.
The three retrieval modes worth building:
Dense retrieval (vectors). Handles conceptual questions, paraphrasing, natural language queries where the user doesn’t know the exact terminology. This is what most people implement and stop at.
Sparse retrieval (BM25/keyword). Handles precise term matching. If the user types a model number, a legal clause, or a company name, keyword search finds it. Vectors bury it under similar-sounding noise.
Metadata filtering. Every chunk should carry structured attributes — date, source, document type, author, category. Filter before you retrieve. Don’t make the reranker sort through irrelevant docs because you didn’t scope the search.
The real work is in fusion. Reciprocal rank fusion (RRF) is the pragmatic default — combine rankings from multiple retrieval strategies without needing calibrated scores. It’s not perfect, but it’s robust and doesn’t require you to tune weights for every query type.
After fusion, a cross-encoder reranker on your top-20 results earns its compute budget. It looks at the full query-chunk pair, not just embeddings, and reorders accordingly. That’s where you recover the precision that retrieval traded away for recall.
The teams that ship reliable RAG products aren’t using fancier models. They’re doing more careful retrieval engineering. The context window is sacred real estate — what you put in it determines everything that comes out.
Build the retrieval layer like you mean it, not like an afterthought between chunking and prompting.