RAG (Retrieval-Augmented Generation) in 2026 is a four-stage pipeline: chunk your documents into 500-1500 token pieces, embed each chunk with a model like Voyage-3 or OpenAI's text-embedding-3-large, store the embeddings in a vector database (Pinecone, Qdrant, MySQL 9's VECTOR type, Postgres pgvector), then at query time, embed the user question, run a cosine-similarity search, optionally rerank, and pass the top-K chunks to the LLM. The whole thing is well-understood now — but 2026's actual question is "should I use RAG or just stuff the relevant context into a 1M-token Claude prompt?" I'll cover the pipeline, the chunking and embedding choices, and the long-context-vs-RAG decision.
The reason this matters: anytime your LLM needs to answer questions from a knowledge base that's bigger than the context window, RAG is the canonical answer. With 1M-token context now standard, that threshold has moved — for some knowledge bases (a single product manual, a contract, a codebase), long context replaces RAG. For others (a CRM with 10 million records, a corpus of legal documents, an entire website), RAG is still the only viable architecture.
Jump to:
- The four stages of RAG
- Chunking: size, overlap, and boundaries
- Embedding models in 2026
- Vector databases compared
- Hybrid search: BM25 + vector
- Reranking: the optional 5th stage
- RAG vs long context: when to skip RAG entirely
- FAQ
The four stages of RAG
Stage 1: Chunk. Split each document into pieces small enough to embed (300-2000 tokens each, depending on the embedding model's context window).
Stage 2: Embed. Generate a vector for each chunk using an embedding model.
Stage 3: Index. Store the vectors in a database that supports approximate-nearest-neighbour (ANN) search.
Stage 4: Retrieve. Embed the user query, run ANN search to find the top K most similar chunks, pass them to the LLM with the user question.
Optional Stage 5 (covered below): rerank the top K to a more focused top N before passing to the LLM.
Chunking: size, overlap, and boundaries
Chunk size trade-off:
- Smaller chunks (300-500 tokens): better precision (the relevant chunk is more focused) but more chunks total, more API calls to embed, more vectors to index. Better for fact-lookup queries.
- Larger chunks (1000-2000 tokens): better context preservation (more surrounding info per chunk) but coarser matching. Better for "explain X" queries that need broader context.
The middle of the range (500-1000 tokens) works for most use cases. Tune based on your eval results.
Overlap: 10-20% between adjacent chunks. The overlap ensures that a fact split across a chunk boundary still appears intact in at least one of the chunks. Without overlap, you sometimes get the "sentence ends with 'the patient' and the next chunk starts with 'is allergic to penicillin'" failure mode.
Boundaries: split on semantic boundaries (paragraphs, sections, sentences) rather than mid-word or mid-sentence. Most chunking libraries (LangChain's RecursiveCharacterTextSplitter, LlamaIndex's chunkers) handle this; if you're rolling your own, prefer \n\n split first, fall back to . split, fall back to a hard character limit.
For code, split on function or class boundaries. For Markdown, split on headings. For HTML, split on <section> or <article> boundaries (similar to the HTML tag matching patterns).
Embedding models in 2026
The major players:
- OpenAI text-embedding-3-large: 3072 dimensions, $0.13 per million tokens. Strong general-purpose.
- Voyage-3: 1024 dimensions, $0.06 per million tokens. Best quality on independent benchmarks for English RAG; particularly strong on code and technical content.
- Cohere embed-english-v3: 1024 dimensions, $0.10 per million tokens. Good with their reranker for the integrated pipeline.
- Gemini text-embedding-004: 768 dimensions, free up to limits, $0.025 per million paid.
- Local options: BGE-large, E5-large, sentence-transformers/all-mpnet-base-v2 — run on your own GPU, zero per-token cost, lower quality than the hosted leaders.
Pick on benchmark performance for your domain (try 2-3 on a representative sample), not on marketing. The gap between the top three is small enough that any of them works; the gap to local models is larger but their zero per-token cost matters at scale.
Vector databases compared
| Database | Hosted | Self-host | Best for |
|---|---|---|---|
| Pinecone | Yes (managed) | No | Production scale, simplest setup |
| Qdrant | Both | Yes | Open-source preference, self-host control |
| Weaviate | Both | Yes | Hybrid search out of the box |
| MySQL 9.0+ VECTOR type | Self-host or managed MySQL | Yes | Already on MySQL, want one fewer service |
| Postgres pgvector | Yes (Neon, Supabase) | Yes | Already on Postgres, want one fewer service |
| Chroma | Local mostly | Yes | Prototyping, embedded use |
| MongoDB Atlas Vector Search | Yes | No | Already on MongoDB |
The 2024-2026 trend has been "do RAG in your existing database" — pgvector for Postgres shops, the new VECTOR type for MySQL 9 shops. Avoids a separate service and keeps your data and embeddings in the same transactional store. For MySQL specifically, How to Add Semantic Search to a MySQL App walks the full integration.
For very high scale (100M+ vectors, single-digit-ms latency), specialized vector databases (Pinecone, Qdrant) still win. Below that scale, in-database vectors are usually fine.
Hybrid search: BM25 + vector
Pure semantic (vector) search misses queries where the exact keyword matters: product names, error codes, specific function names. Pure BM25 (keyword) search misses queries phrased differently from the source.
Hybrid search runs both and merges results. The standard merge is RRF (Reciprocal Rank Fusion):
def rrf(rankings_by_method, k=60):
scores = {}
for ranking in rankings_by_method:
for i, doc_id in enumerate(ranking):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + i + 1)
return sorted(scores, key=scores.get, reverse=True)The vector search returns its top 20 by cosine similarity; the BM25 search returns its top 20 by keyword score; RRF merges them and you take the top 10 of the merged result.
For most RAG implementations, hybrid search delivers 10-20% higher recall than vector-only. Worth the complexity for any production system.
Reranking: the optional 5th stage
After retrieving top K (say 20-50 chunks), pass them through a reranker that scores each chunk for relevance to the specific query. Take the top N (say 5-10) of the reranked results to pass to the LLM.
The leaders: Cohere Rerank-3, Voyage Rerank-2, BGE-Reranker (open-source). All are cross-encoders that take the query + chunk pair and output a score — slower than vector search per chunk but much more accurate.
The payoff: a small drop in latency (reranking 20 chunks is maybe 200ms) for a meaningful boost in precision. If your evals show a chunk-precision problem, reranking is the first place to look.
RAG vs long context: when to skip RAG entirely
In 2026 you can fit a 1M-token corpus into a Claude or Gemini context window. For many knowledge bases — a single product manual, a contract, a codebase — that's enough that you don't need RAG at all.
The decision rule:
- Corpus fits comfortably in context (under ~500K tokens) and is read often: long context is simpler, more accurate, and with prompt caching the cost is acceptable.
- Corpus is too big for context, OR queries are infrequent, OR you have many tenants each with their own corpus: RAG is the right architecture.
- Hybrid: use RAG to retrieve the right document, then load that one document fully into context.
Long context isn't a magic bullet — models still attend to early and late content more than the middle ("lost in the middle" problem), and quality on very long contexts varies by model. But for many real workloads, the simplicity of "stuff the whole manual into the prompt" is a win.
What to do next
For the application-side patterns:
- How to Add Semantic Search to a MySQL App — practical walk-through of the MySQL 9
VECTORtype as the storage layer. - How to Stop an LLM from Hallucinating — RAG is the biggest anti-hallucination technique; this covers the grounding patterns that go with it.
- How to Build an LLM Agent with Tool Use — agentic RAG (the model decides when to search) is the natural next step.
For cost optimisation of the retrieved-context part of the prompt:
- How to Cut LLM API Costs with Prompt Caching — cache the system prompt and tool definitions; the per-query retrieved context is dynamic.
External references: Anthropic on RAG and context engineering, OpenAI embeddings guide, Voyage AI embedding models.
FAQ
See also
- Elasticsearch Cheat Sheet: the
dense_vectormapping,knnquery, andsemantic_textfield type used as the vector store in production RAG pipelines - How to Add Semantic Search to a MySQL App: the lighter-weight alternative when the corpus is small enough to live in your application database
- How to Choose Between Claude Haiku, Sonnet, and Opus: picking the right generator model for the answer-generation step of RAG
- How to Get Reliable JSON from an LLM: when the generation step needs structured output rather than prose





