TechEarl

How to Build RAG with Embeddings and Vector Search

Build RAG with embeddings, chunking, and vector search. Covers OpenAI / Voyage / Cohere embeddings, chunk sizes, hybrid search, reranking, and when long context replaces RAG entirely.

Ishan KarunaratneIshan Karunaratne⏱️ 10 min readUpdated
Build RAG with embeddings, chunking, vector search, and reranking. OpenAI / Voyage / Cohere compared. When long context replaces RAG entirely.

RAG (Retrieval-Augmented Generation) in 2026 is a four-stage pipeline: chunk your documents into 500-1500 token pieces, embed each chunk with a model like Voyage-3 or OpenAI's text-embedding-3-large, store the embeddings in a vector database (Pinecone, Qdrant, MySQL 9's VECTOR type, Postgres pgvector), then at query time, embed the user question, run a cosine-similarity search, optionally rerank, and pass the top-K chunks to the LLM. The whole thing is well-understood now — but 2026's actual question is "should I use RAG or just stuff the relevant context into a 1M-token Claude prompt?" I'll cover the pipeline, the chunking and embedding choices, and the long-context-vs-RAG decision.

The reason this matters: anytime your LLM needs to answer questions from a knowledge base that's bigger than the context window, RAG is the canonical answer. With 1M-token context now standard, that threshold has moved — for some knowledge bases (a single product manual, a contract, a codebase), long context replaces RAG. For others (a CRM with 10 million records, a corpus of legal documents, an entire website), RAG is still the only viable architecture.

Jump to:

The four stages of RAG

Stage 1: Chunk. Split each document into pieces small enough to embed (300-2000 tokens each, depending on the embedding model's context window).

Stage 2: Embed. Generate a vector for each chunk using an embedding model.

Stage 3: Index. Store the vectors in a database that supports approximate-nearest-neighbour (ANN) search.

Stage 4: Retrieve. Embed the user query, run ANN search to find the top K most similar chunks, pass them to the LLM with the user question.

Optional Stage 5 (covered below): rerank the top K to a more focused top N before passing to the LLM.

Chunking: size, overlap, and boundaries

Chunk size trade-off:

  • Smaller chunks (300-500 tokens): better precision (the relevant chunk is more focused) but more chunks total, more API calls to embed, more vectors to index. Better for fact-lookup queries.
  • Larger chunks (1000-2000 tokens): better context preservation (more surrounding info per chunk) but coarser matching. Better for "explain X" queries that need broader context.

The middle of the range (500-1000 tokens) works for most use cases. Tune based on your eval results.

Overlap: 10-20% between adjacent chunks. The overlap ensures that a fact split across a chunk boundary still appears intact in at least one of the chunks. Without overlap, you sometimes get the "sentence ends with 'the patient' and the next chunk starts with 'is allergic to penicillin'" failure mode.

Boundaries: split on semantic boundaries (paragraphs, sections, sentences) rather than mid-word or mid-sentence. Most chunking libraries (LangChain's RecursiveCharacterTextSplitter, LlamaIndex's chunkers) handle this; if you're rolling your own, prefer \n\n split first, fall back to . split, fall back to a hard character limit.

For code, split on function or class boundaries. For Markdown, split on headings. For HTML, split on <section> or <article> boundaries (similar to the HTML tag matching patterns).

Embedding models in 2026

The major players:

  • OpenAI text-embedding-3-large: 3072 dimensions, $0.13 per million tokens. Strong general-purpose.
  • Voyage-3: 1024 dimensions, $0.06 per million tokens. Best quality on independent benchmarks for English RAG; particularly strong on code and technical content.
  • Cohere embed-english-v3: 1024 dimensions, $0.10 per million tokens. Good with their reranker for the integrated pipeline.
  • Gemini text-embedding-004: 768 dimensions, free up to limits, $0.025 per million paid.
  • Local options: BGE-large, E5-large, sentence-transformers/all-mpnet-base-v2 — run on your own GPU, zero per-token cost, lower quality than the hosted leaders.

Pick on benchmark performance for your domain (try 2-3 on a representative sample), not on marketing. The gap between the top three is small enough that any of them works; the gap to local models is larger but their zero per-token cost matters at scale.

Vector databases compared

DatabaseHostedSelf-hostBest for
PineconeYes (managed)NoProduction scale, simplest setup
QdrantBothYesOpen-source preference, self-host control
WeaviateBothYesHybrid search out of the box
MySQL 9.0+ VECTOR typeSelf-host or managed MySQLYesAlready on MySQL, want one fewer service
Postgres pgvectorYes (Neon, Supabase)YesAlready on Postgres, want one fewer service
ChromaLocal mostlyYesPrototyping, embedded use
MongoDB Atlas Vector SearchYesNoAlready on MongoDB

The 2024-2026 trend has been "do RAG in your existing database" — pgvector for Postgres shops, the new VECTOR type for MySQL 9 shops. Avoids a separate service and keeps your data and embeddings in the same transactional store. For MySQL specifically, How to Add Semantic Search to a MySQL App walks the full integration.

For very high scale (100M+ vectors, single-digit-ms latency), specialized vector databases (Pinecone, Qdrant) still win. Below that scale, in-database vectors are usually fine.

Hybrid search: BM25 + vector

Pure semantic (vector) search misses queries where the exact keyword matters: product names, error codes, specific function names. Pure BM25 (keyword) search misses queries phrased differently from the source.

Hybrid search runs both and merges results. The standard merge is RRF (Reciprocal Rank Fusion):

python
def rrf(rankings_by_method, k=60):
    scores = {}
    for ranking in rankings_by_method:
        for i, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + i + 1)
    return sorted(scores, key=scores.get, reverse=True)

The vector search returns its top 20 by cosine similarity; the BM25 search returns its top 20 by keyword score; RRF merges them and you take the top 10 of the merged result.

For most RAG implementations, hybrid search delivers 10-20% higher recall than vector-only. Worth the complexity for any production system.

Reranking: the optional 5th stage

After retrieving top K (say 20-50 chunks), pass them through a reranker that scores each chunk for relevance to the specific query. Take the top N (say 5-10) of the reranked results to pass to the LLM.

The leaders: Cohere Rerank-3, Voyage Rerank-2, BGE-Reranker (open-source). All are cross-encoders that take the query + chunk pair and output a score — slower than vector search per chunk but much more accurate.

The payoff: a small drop in latency (reranking 20 chunks is maybe 200ms) for a meaningful boost in precision. If your evals show a chunk-precision problem, reranking is the first place to look.

RAG vs long context: when to skip RAG entirely

In 2026 you can fit a 1M-token corpus into a Claude or Gemini context window. For many knowledge bases — a single product manual, a contract, a codebase — that's enough that you don't need RAG at all.

The decision rule:

  • Corpus fits comfortably in context (under ~500K tokens) and is read often: long context is simpler, more accurate, and with prompt caching the cost is acceptable.
  • Corpus is too big for context, OR queries are infrequent, OR you have many tenants each with their own corpus: RAG is the right architecture.
  • Hybrid: use RAG to retrieve the right document, then load that one document fully into context.

Long context isn't a magic bullet — models still attend to early and late content more than the middle ("lost in the middle" problem), and quality on very long contexts varies by model. But for many real workloads, the simplicity of "stuff the whole manual into the prompt" is a win.

What to do next

For the application-side patterns:

For cost optimisation of the retrieved-context part of the prompt:

External references: Anthropic on RAG and context engineering, OpenAI embeddings guide, Voyage AI embedding models.

FAQ

See also

TagsRAGEmbeddingsVector SearchAILLMPineconeOpenAI
Share
Ishan Karunaratne

Ishan Karunaratne

Tech Architect · Software Engineer · AI/DevOps

Tech architect and software engineer with 20+ years across software, Linux systems, DevOps, and infrastructure — and a more recent focus on AI. Currently Chief Technology Officer at a tech startup in the healthcare space.

Keep reading

Related posts

Build an LLM agent with tool use. The agentic loop, tool-call formats on Anthropic / OpenAI / Gemini, JavaScript and Python code, common failure modes.

How to Build an LLM Agent with Tool Use

Build an LLM agent with tool use: the agentic loop, the tool-call format on Anthropic, OpenAI, and Gemini, runnable code in JavaScript and Python, plus the common failure modes.

Create an EBS volume with aws ec2 create-volume, attach it to a running EC2 instance, format with mkfs.ext4 or mkfs.xfs, mount it, and persist across reboots with a UUID-based /etc/fstab entry. Console, AWS CLI, and Terraform walkthroughs.

How to Add an EBS Volume to an EC2 Instance

Create an EBS volume, attach it to a running EC2 instance, format and mount it, and survive reboots with a UUID-based fstab entry. Console, AWS CLI, and Terraform walkthroughs plus the Nitro device-naming gotcha that trips everyone.