How to Build RAG with Embeddings and Vector Search (2026)

RAG (Retrieval-Augmented Generation) in 2026 is a four-stage pipeline: chunk your documents into 500-1500 token pieces, embed each chunk with a model like Voyage-3 or OpenAI's text-embedding-3-large, store the embeddings in a vector database (Pinecone, Qdrant, MySQL 9's VECTOR type, Postgres pgvector), then at query time, embed the user question, run a cosine-similarity search, optionally rerank, and pass the top-K chunks to the LLM. The whole thing is well-understood now — but 2026's actual question is "should I use RAG or just stuff the relevant context into a 1M-token Claude prompt?" I'll cover the pipeline, the chunking and embedding choices, and the long-context-vs-RAG decision.

The reason this matters: anytime your LLM needs to answer questions from a knowledge base that's bigger than the context window, RAG is the canonical answer. With 1M-token context now standard, that threshold has moved — for some knowledge bases (a single product manual, a contract, a codebase), long context replaces RAG. For others (a CRM with 10 million records, a corpus of legal documents, an entire website), RAG is still the only viable architecture.

Jump to:

The four stages of RAG
Chunking: size, overlap, and boundaries
Embedding models in 2026
Vector databases compared
Hybrid search: BM25 + vector
Reranking: the optional 5th stage
RAG vs long context: when to skip RAG entirely
FAQ

The four stages of RAG

Stage 1: Chunk. Split each document into pieces small enough to embed (300-2000 tokens each, depending on the embedding model's context window).

Stage 2: Embed. Generate a vector for each chunk using an embedding model.

Stage 3: Index. Store the vectors in a database that supports approximate-nearest-neighbour (ANN) search.

Stage 4: Retrieve. Embed the user query, run ANN search to find the top K most similar chunks, pass them to the LLM with the user question.

Optional Stage 5 (covered below): rerank the top K to a more focused top N before passing to the LLM.

Chunking: size, overlap, and boundaries

Chunk size trade-off:

Smaller chunks (300-500 tokens): better precision (the relevant chunk is more focused) but more chunks total, more API calls to embed, more vectors to index. Better for fact-lookup queries.
Larger chunks (1000-2000 tokens): better context preservation (more surrounding info per chunk) but coarser matching. Better for "explain X" queries that need broader context.

The middle of the range (500-1000 tokens) works for most use cases. Tune based on your eval results.

Overlap: 10-20% between adjacent chunks. The overlap ensures that a fact split across a chunk boundary still appears intact in at least one of the chunks. Without overlap, you sometimes get the "sentence ends with 'the patient' and the next chunk starts with 'is allergic to penicillin'" failure mode.

Boundaries: split on semantic boundaries (paragraphs, sections, sentences) rather than mid-word or mid-sentence. Most chunking libraries (LangChain's RecursiveCharacterTextSplitter, LlamaIndex's chunkers) handle this; if you're rolling your own, prefer \n\n split first, fall back to . split, fall back to a hard character limit.

For code, split on function or class boundaries. For Markdown, split on headings. For HTML, split on <section> or <article> boundaries (similar to the HTML tag matching patterns).

Embedding models in 2026

The major players:

OpenAI text-embedding-3-large: 3072 dimensions, $0.13 per million tokens. Strong general-purpose.
Voyage-3: 1024 dimensions, $0.06 per million tokens. Best quality on independent benchmarks for English RAG; particularly strong on code and technical content.
Cohere embed-english-v3: 1024 dimensions, $0.10 per million tokens. Good with their reranker for the integrated pipeline.
Gemini gemini-embedding-001: 3072 dimensions by default (truncatable to 1536 or 768 via Matryoshka representation), $0.15 per million tokens. This is the GA replacement for text-embedding-004, which Google retired on 2026-01-14.
Local options: BGE-large, E5-large, sentence-transformers/all-mpnet-base-v2 — run on your own GPU, zero per-token cost, lower quality than the hosted leaders.

Pick on benchmark performance for your domain (try 2-3 on a representative sample), not on marketing. The gap between the top three is small enough that any of them works; the gap to local models is larger but their zero per-token cost matters at scale.

Vector databases compared

Database	Hosted	Self-host	Best for
Pinecone	Yes (managed)	No	Production scale, simplest setup
Qdrant	Both	Yes	Open-source preference, self-host control
Weaviate	Both	Yes	Hybrid search out of the box
MySQL 9.0+ VECTOR type	Self-host or managed MySQL	Yes	Already on MySQL, want one fewer service
Postgres pgvector	Yes (Neon, Supabase)	Yes	Already on Postgres, want one fewer service
Chroma	Local mostly	Yes	Prototyping, embedded use
MongoDB Atlas Vector Search	Yes	No	Already on MongoDB

The 2024-2026 trend has been "do RAG in your existing database" — pgvector for Postgres shops, the new VECTOR type for MySQL 9 shops. Avoids a separate service and keeps your data and embeddings in the same transactional store. For MySQL specifically, How to Add Semantic Search to a MySQL App walks the full integration.

For very high scale (100M+ vectors, single-digit-ms latency), specialized vector databases (Pinecone, Qdrant) still win. Below that scale, in-database vectors are usually fine.

Hybrid search: BM25 + vector

Pure semantic (vector) search misses queries where the exact keyword matters: product names, error codes, specific function names. Pure BM25 (keyword) search misses queries phrased differently from the source.

Hybrid search runs both and merges results. The standard merge is RRF (Reciprocal Rank Fusion):

python

def rrf(rankings_by_method, k=60):
    scores = {}
    for ranking in rankings_by_method:
        for i, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + i + 1)
    return sorted(scores, key=scores.get, reverse=True)

The vector search returns its top 20 by cosine similarity; the BM25 search returns its top 20 by keyword score; RRF merges them and you take the top 10 of the merged result.

For most RAG implementations, hybrid search delivers 10-20% higher recall than vector-only. Worth the complexity for any production system.

Reranking: the optional 5th stage

After retrieving top K (say 20-50 chunks), pass them through a reranker that scores each chunk for relevance to the specific query. Take the top N (say 5-10) of the reranked results to pass to the LLM.

The leaders: Cohere Rerank 3.5 (rerank-v3.5), Voyage Rerank-2, BGE-Reranker (open-source). All are cross-encoders that take the query + chunk pair and output a score — slower than vector search per chunk but much more accurate.

The payoff: a small drop in latency (reranking 20 chunks is maybe 200ms) for a meaningful boost in precision. If your evals show a chunk-precision problem, reranking is the first place to look.

RAG vs long context: when to skip RAG entirely

In 2026 you can fit a 1M-token corpus into a Claude or Gemini context window. For many knowledge bases — a single product manual, a contract, a codebase — that's enough that you don't need RAG at all.

The decision rule:

Corpus fits comfortably in context (under ~500K tokens) and is read often: long context is simpler, more accurate, and with prompt caching the cost is acceptable.
Corpus is too big for context, OR queries are infrequent, OR you have many tenants each with their own corpus: RAG is the right architecture.
Hybrid: use RAG to retrieve the right document, then load that one document fully into context.

Long context isn't a magic bullet — models still attend to early and late content more than the middle ("lost in the middle" problem), and quality on very long contexts varies by model. But for many real workloads, the simplicity of "stuff the whole manual into the prompt" is a win.

What to do next

For the application-side patterns:

How to Add Semantic Search to a MySQL App — practical walk-through of the MySQL 9 VECTOR type as the storage layer.
How to Stop an LLM from Hallucinating — RAG is the biggest anti-hallucination technique; this covers the grounding patterns that go with it.
How to Build an LLM Agent with Tool Use — agentic RAG (the model decides when to search) is the natural next step.

For cost optimisation of the retrieved-context part of the prompt:

How to Cut LLM API Costs with Prompt Caching — cache the system prompt and tool definitions; the per-query retrieved context is dynamic.

External references: Anthropic on RAG and context engineering, OpenAI embeddings guide, Voyage AI embedding models.

FAQ

RAG (Retrieval-Augmented Generation) is the pattern of retrieving relevant documents from a knowledge base, including them in the LLM prompt, and generating an answer from that retrieved context. You need it when your knowledge base is too large to fit in the context window, or when the corpus updates more often than you can retrain.

In 2026, with 1M-token context windows, RAG is sometimes optional for smaller corpora. The decision tree in this article walks when to use which.

500-1000 tokens per chunk, with 10-20% overlap between adjacent chunks. Smaller chunks (300-500) give better precision for fact-lookup queries; larger chunks (1000-2000) preserve more context for "explain X" queries.

Tune based on your evals. Run the same 50 questions through chunk sizes of 400, 800, and 1500, score the answers, pick the best. The right size is workload-specific.

For English: Voyage-3 leads independent benchmarks on technical content. OpenAI's text-embedding-3-large is a solid default. For multilingual: Cohere embed-multilingual or BGE-M3.

The gap between the top three is small. Run a quick benchmark on a sample of your own data before committing — relative performance varies by domain.

Pinecone is the most polished hosted option. Qdrant is the leading open-source alternative. For most teams in 2026, the right answer is "use your existing database" — pgvector on Postgres, the new VECTOR type on MySQL 9, MongoDB Atlas Vector Search. One fewer service to operate.

Switch to a dedicated vector DB when scale demands it: 100M+ vectors, sub-10ms query latency requirements, or specific features (metadata filtering at huge scale) that your in-database option doesn't handle.

Sometimes no. If your corpus fits in context (under ~500K tokens to leave room for the response), loading the whole thing is simpler and often more accurate than retrieving slices. Prompt caching makes the cost acceptable.

You still need RAG when: the corpus is genuinely too big, queries are infrequent (caching doesn't help), or you have many tenants each with their own corpus. For a single product manual: skip RAG. For an entire company's wiki: keep RAG.

How to Build RAG with Embeddings and Vector Search

The four stages of RAG

Chunking: size, overlap, and boundaries

Embedding models in 2026

Vector databases compared

Hybrid search: BM25 + vector

Reranking: the optional 5th stage

RAG vs long context: when to skip RAG entirely

What to do next

FAQ

See also

Sources

Ishan Karunaratne

Related posts

How to Build an LLM Agent with Tool Use

How to Exclude Matches with grep -v (Invert Match)

How to Run MariaDB in Docker (With Persistent Storage)

What is RAG and when do I need it?

What chunk size should I use for RAG?

Which embedding model is best for RAG in 2026?

Is Pinecone the best vector database for RAG?

Do I still need RAG with 1M-token context windows?

Sources

Ishan Karunaratne