Retrieval Infrastructure

Building Context for LLMs: The Retrieval Problem at Scale

Yuki Nakashima August 24, 2023

The single most important architectural decision in an LLM application is how you fill the context window. A model's output quality is bounded by the quality of what it can see, and what it can see is bounded by the retrieval architecture beneath it. Retrieval-augmented generation is not a technique — it is an infrastructure problem with a wide solution space and significant performance implications.

The retrieval problem has four components that interact in non-obvious ways: document chunking, embedding quality, index performance, and reranking. Most tutorials present these as a pipeline with obvious defaults. In production at scale, each component has failure modes that compound with the others in ways that are difficult to diagnose without systematic instrumentation.

The chunking problem

Documents don't naturally decompose into retrieval units. The right chunk size depends on the embedding model's optimal input length, the granularity of the retrieval queries, the coherence of the document structure, and the downstream model's context window size. Fixed-size chunking with 512 tokens is a reasonable default for experimentation. It's rarely the right answer for production systems where the source documents have meaningful structure — code repositories, legal documents, knowledge bases, research papers — that informs how they should be retrieved.

Semantic chunking — splitting documents at semantic boundaries rather than token boundaries — tends to improve retrieval quality significantly for structured document types. Hierarchical chunking, where documents are indexed at multiple granularities simultaneously, improves performance on queries that range from specific to general. These are not exotic techniques, but they require more sophisticated pipeline infrastructure than a simple character splitter.

Embedding model selection is underappreciated

The dominant embedding models for text retrieval — OpenAI's ada-002 being the most widely used — are trained on general text and optimized for general semantic similarity. For domain-specific retrieval tasks, they often underperform fine-tuned or domain-specific alternatives by meaningful margins. The gap is most pronounced in technical domains — code, scientific literature, legal text — where the semantic similarity that matters for retrieval is structured by domain conventions that general-purpose models don't encode well.

Most teams don't evaluate this empirically before deploying. They use the default model, observe mediocre retrieval quality, and attribute it to retrieval architecture rather than embedding quality. Systematic embedding model evaluation is basic practice but surprisingly uncommon.

Where vector databases fit

Vector databases have absorbed a lot of the conversation about the retrieval infrastructure stack, for good reason — they're the component that's most obviously missing from a standard relational database setup. But vector search quality is largely an embedding quality and indexing strategy problem, not a database choice problem. Most major vector databases offer comparable raw ANN search performance. The differentiation is in metadata filtering performance, hybrid search (combining vector and keyword retrieval), update performance, and operational manageability.

At Flintrock, we've been watching both Qdrant and Weaviate solve different parts of this problem in ways that reflect different architectural philosophies. Qdrant's filterable HNSW approach treats metadata filtering as an index-level concern. Weaviate's object-native model treats the vector and its metadata as first-class together, which simplifies application logic at the cost of a less general query model. Both are correct for certain workloads. The right choice is workload-dependent, not universal.

Reranking changes the economics

Two-stage retrieval — retrieve a large candidate set from the vector index, then rerank with a more expensive cross-encoder model — substantially improves retrieval quality at modest latency cost. This is increasingly standard practice in production RAG systems at scale. The economic implication is that the first-stage retrieval index needs to recall at high volume (top-100 or top-200 rather than top-5), which changes the performance profile required from the vector database. Infrastructure designed around top-5 retrieval will underperform at top-100, sometimes significantly. This is a planning consideration that most teams miss when initially sizing their retrieval infrastructure.