How to Add Semantic Search to a MySQL App

MySQL 9.0 added a native VECTOR type that lets you store embeddings alongside your existing data instead of adding Pinecone or pgvector to the stack. The integration is four steps: add a VECTOR column to the table you want to search, generate embeddings with a model like Voyage-3 or OpenAI's text-embedding-3-small, backfill the column for existing rows, then rank by cosine similarity at search time. I'll walk the full integration on a products table, the same table from the MySQL import CSV walkthrough, with code in Python and Node.js.

One thing to get straight up front, because it shapes every code sample below: the DISTANCE() similarity function and VECTOR INDEX (HNSW) DDL are MySQL HeatWave features on OCI, not part of MySQL Community or Commercial editions. Standard MySQL 9.x ships the VECTOR storage type and the STRING_TO_VECTOR() / VECTOR_TO_STRING() conversion functions, but it has no in-database cosine-distance function and no approximate-nearest-neighbour index. On a standard install you store the vectors in MySQL and compute the cosine ranking in your application code. That is exactly what this guide does. If you run HeatWave, I note the shortcuts it unlocks where they apply.

The reason this is appealing in 2026: the "use a dedicated vector database" architecture made sense in 2023 when Postgres and MySQL had no vector primitives. Now that they do, keeping embeddings and source data in the same transactional database removes a service from your stack, lets you join semantic results with regular SQL filters, and keeps everything in one backup/replication path. For workloads under ~10M vectors, in-database is the cleaner architecture.

Jump to:

Step 1: Add the VECTOR column
Step 2: Pick an embedding model
Step 3: Backfill existing rows
Step 4: Rank with cosine similarity
Combining with regular SQL filters
Keeping embeddings in sync on writes
Performance: indexing and limits
FAQ

Step 1: Add the VECTOR column

MySQL 9.0 introduced VECTOR(N) where N is the dimension count. The number must match your embedding model's output dimension — 1024 for Voyage-3, 1536 for OpenAI text-embedding-3-small, 3072 for text-embedding-3-large.

sql

ALTER TABLE products
ADD COLUMN description_embedding VECTOR(1024) NULL;

NULL for now — you'll backfill in step 3. After backfill, you can optionally tighten with ALTER TABLE products MODIFY description_embedding VECTOR(1024) NOT NULL; once the column is fully populated.

For the broader ADD COLUMN syntax including the modern ALGORITHM=INSTANT for fast schema changes, see How to Add a Column to a MySQL Table.

Step 2: Pick an embedding model

The major choices in 2026 (covered in detail in How to Build RAG with Embeddings and Vector Search):

Voyage-3 (1024 dim) — best benchmark scores on technical content, $0.06 per million tokens
OpenAI text-embedding-3-small (1536 dim) — solid default, $0.02 per million tokens
OpenAI text-embedding-3-large (3072 dim) — higher quality, $0.13 per million tokens
Local (Voyage-Code-3, BGE-large) — zero per-token cost, run on your own GPU

For e-commerce product search like the products example: Voyage-3 or text-embedding-3-small are both strong. Pick the cheaper one for the prototype, evaluate quality on real queries, upgrade if needed.

Match the dimension to your VECTOR(N) column. If you pick the model first, set N accordingly; if you've already created the column, pick a model with that dimension.

Step 3: Backfill existing rows

For an existing products table with thousands of rows, generate embeddings for each row's description and update the description_embedding column.

Python with OpenAI:

python

import os
from openai import OpenAI
import mysql.connector

openai = OpenAI()
db = mysql.connector.connect(
    host="localhost", user="root", password=os.environ["DB_PASS"],
    database="myshop", autocommit=False,
)
cursor = db.cursor()

# Stream products in batches
cursor.execute("SELECT sku, description FROM products WHERE description_embedding IS NULL")
batch_size = 100
batch = []
for sku, description in cursor.fetchall():
    batch.append((sku, description))
    if len(batch) >= batch_size:
        skus = [s for s, _ in batch]
        descriptions = [d for _, d in batch]
        # Batch embed (OpenAI supports up to 2048 inputs per call)
        response = openai.embeddings.create(model="text-embedding-3-small", input=descriptions)
        for (sku, _), emb in zip(batch, response.data):
            # MySQL expects vectors as a JSON-like array string
            vec_literal = "[" + ",".join(str(v) for v in emb.embedding) + "]"
            cursor.execute(
                "UPDATE products SET description_embedding = STRING_TO_VECTOR(%s) WHERE sku = %s",
                (vec_literal, sku),
            )
        db.commit()
        batch = []
# Process leftover
if batch:
    descriptions = [d for _, d in batch]
    response = openai.embeddings.create(model="text-embedding-3-small", input=descriptions)
    for (sku, _), emb in zip(batch, response.data):
        vec_literal = "[" + ",".join(str(v) for v in emb.embedding) + "]"
        cursor.execute(
            "UPDATE products SET description_embedding = STRING_TO_VECTOR(%s) WHERE sku = %s",
            (vec_literal, sku),
        )
    db.commit()

STRING_TO_VECTOR() (also spelled TO_VECTOR()) is MySQL 9's function for converting a JSON-array literal string to a VECTOR value. It is part of standard MySQL 9.x, Community edition included. Batch sizes of 100 keep the OpenAI call efficient (the API supports up to 2048 inputs per call but smaller batches recover faster from rate-limit errors).

For a 10K-row table this runs in 1-3 minutes including network. For 1M rows, plan an hour and run it from a server close to the embedding provider's region.

Step 4: Rank with cosine similarity

Standard MySQL stores the VECTOR value but has no function to compare two vectors, so the ranking happens in application code. Search-time: embed the user query, pull the candidate vectors out of MySQL with VECTOR_TO_STRING(), and score each one with cosine similarity in Python.

python

import json

def cosine_similarity(a, b):
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = sum(x * x for x in a) ** 0.5
    norm_b = sum(x * x for x in b) ** 0.5
    return dot / (norm_a * norm_b) if norm_a and norm_b else 0.0

def search(query: str, limit: int = 10):
    # Embed the query
    response = openai.embeddings.create(model="text-embedding-3-small", input=[query])
    query_vec = response.data[0].embedding

    # Pull candidate rows; VECTOR_TO_STRING returns the embedding as a JSON array string
    cursor.execute(
        """
        SELECT sku, name, description,
               VECTOR_TO_STRING(description_embedding) AS embedding
        FROM products
        WHERE description_embedding IS NOT NULL
        """
    )
    scored = []
    for sku, name, description, embedding in cursor.fetchall():
        row_vec = json.loads(embedding)
        score = cosine_similarity(query_vec, row_vec)
        scored.append((sku, name, description, score))

    # Highest similarity first
    scored.sort(key=lambda r: r[3], reverse=True)
    return scored[:limit]

cosine_similarity() returns a value from -1.0 (opposite) to 1.0 (identical); higher is more similar. VECTOR_TO_STRING() (also spelled FROM_VECTOR()) serialises the stored VECTOR back to a JSON array string that json.loads() parses into a Python list. In production, use NumPy for the math: numpy.dot(a, b) / (numpy.linalg.norm(a) * numpy.linalg.norm(b)) is faster than the pure-Python loop above and the difference matters once you score thousands of rows per query.

For a Node.js equivalent, the mysql2/promise driver works the same way: run the SELECT ... VECTOR_TO_STRING(...) query, JSON.parse each embedding, and score with a cosine function.

If you run MySQL HeatWave on OCI, you can skip the application-side math. HeatWave adds the DISTANCE(v1, v2, 'COSINE') function, so the ranking moves into SQL:

sql

SELECT sku, name, description,
       DISTANCE(description_embedding, STRING_TO_VECTOR('[...]'), 'COSINE') AS distance
FROM products
WHERE description_embedding IS NOT NULL
ORDER BY distance ASC
LIMIT 10;

DISTANCE() returns the cosine distance (0.0 = identical, 2.0 = opposite), so HeatWave queries sort ASC. This function does not exist in Community or Commercial MySQL.

Combining with regular SQL filters

The killer feature of in-database vectors: SQL filters compose with similarity ranking. Need products in stock under $100 similar to "rugged hiking boots"? You push the structured filters into the SELECT so MySQL only returns matching candidates, then rank those candidates by cosine similarity in application code.

sql

SELECT sku, name, price_cents,
       VECTOR_TO_STRING(description_embedding) AS embedding
FROM products
WHERE description_embedding IS NOT NULL
  AND in_stock = 1
  AND price_cents < 10000;

The WHERE clause does the cheap, indexable work in the database. Your Python or Node code then scores the returned rows with the cosine function from step 4 and keeps the top results. With a dedicated vector database you would fetch candidates from the vector DB, then query MySQL for stock and price, then re-merge: more services, more code, more failure modes. Keeping the vectors in MySQL collapses the structured-filter half into a single statement.

On MySQL HeatWave the whole thing is one query, because DISTANCE() can sit in the same SELECT as the WHERE filters:

sql

SELECT sku, name, price_cents,
       DISTANCE(description_embedding, STRING_TO_VECTOR(:query_vec), 'COSINE') AS distance
FROM products
WHERE description_embedding IS NOT NULL
  AND in_stock = 1
  AND price_cents < 10000
ORDER BY distance ASC
LIMIT 10;

Keeping embeddings in sync on writes

When a product's description changes, the embedding becomes stale. Three patterns to keep them in sync:

1. Write-time sync (simplest, slowest writes): in the same transaction that updates description, call the embedding API and update description_embedding. Adds 100-300ms to each write — fine for low-volume admin edits, painful for high-frequency updates.

2. Async sync via queue (preferred for production): writes update description and emit a "re-embed this row" message to a job queue (SQS, RabbitMQ, BullMQ). A worker consumes the queue, embeds the new description, updates the row. Writes stay fast; embeddings catch up within seconds.

3. Periodic batch resync: a nightly job re-embeds every row whose description has been updated since the last run. Tracked via a description_updated_at timestamp column. Simplest if your data doesn't need real-time semantic-search accuracy.

For an e-commerce catalog with thousands of writes per day, async-queue sync is the right pattern. For a small CMS with a handful of edits per week, write-time sync is fine.

Performance: indexing and limits

This is the honest constraint of semantic search on standard MySQL: there is no approximate-nearest-neighbour (ANN) index. Every search reads the candidate rows the WHERE clause selects and scores each one. On Community or Commercial MySQL 9.x the vector comparison is always a linear scan over those candidates; the VECTOR type is storage only.

So the performance story is about keeping the candidate set small:

Lean hard on structured filters. Real indexes on in_stock, price_cents, category_id and the like cut the candidate set before any cosine math runs. A query that narrows to 2,000 candidate rows scores in single-digit milliseconds; one that scans 500,000 does not.
Score with NumPy, not a Python loop. Vectorised dot products and norms over a matrix of candidate embeddings are an order of magnitude faster than the per-row pure-Python version.
Cache query embeddings. Repeated searches for the same phrase should not re-hit the embedding API.

For a products table where filters narrow to a few thousand candidates, application-side ranking stays comfortably interactive. Once a single unfiltered search would have to score hundreds of thousands of rows, the linear scan stops being acceptable, and that is the point to move to a real ANN-indexed store.

If you run MySQL HeatWave on OCI, that ceiling lifts: HeatWave GenAI builds HNSW (Hierarchical Navigable Small World) vector indexes automatically for frequently queried VECTOR columns, taking similarity queries from an O(N) scan to an O(log N) approximate lookup. The VECTOR INDEX DDL and HNSW indexing are HeatWave features and do not exist in Community or Commercial MySQL.

Scaling limits common to all MySQL 9.x editions: a VECTOR value holds up to 16KB (about 4000 floats at 4 bytes each). For standard MySQL the practical sweet spot for in-database semantic search is small candidate sets behind selective filters; for large unfiltered corpora, HeatWave or a dedicated vector database pulls ahead on raw query latency.

What to do next

For the broader MySQL operational toolkit this semantic-search feature plugs into:

How to Add a Column to a MySQL Table — the ADD COLUMN patterns including ALGORITHM=INSTANT for fast schema changes.
How to Find Duplicate Rows in MySQL — useful for catching duplicate product descriptions before they pollute your embedding space.
MySQL Cheat Sheet — the wider syntax reference for queries, indexing, and admin commands.

For the AI side that produces the embeddings:

How to Build RAG with Embeddings and Vector Search — the broader RAG architecture this MySQL implementation slots into.
How to Cut LLM API Costs with Prompt Caching — for the LLM step downstream of semantic retrieval.

External references: MySQL 9 VECTOR documentation, Voyage AI embeddings, OpenAI embeddings.

FAQ

Partly. Since MySQL 9.0 (released July 2024) the core engine ships the VECTOR(N) storage type and the STRING_TO_VECTOR() / VECTOR_TO_STRING() conversion functions (aliases TO_VECTOR() / FROM_VECTOR()), so Community and Commercial MySQL can store embeddings.

What standard MySQL does not have is an in-database similarity function or an ANN index. The cosine DISTANCE() function and the VECTOR INDEX (HNSW) syntax are MySQL HeatWave features on OCI only. On a standard install you store the vectors in MySQL and compute the cosine ranking in your application code.

For workloads under ~10M vectors, MySQL VECTOR is the right answer if your application data is already in MySQL — one fewer service, transactional consistency, SQL filters compose naturally with similarity ranking.

For 10M+ vectors or sub-10ms latency requirements, dedicated vector databases (Pinecone, Qdrant) pull ahead on raw performance. Most production AI apps don't reach that scale.

Match it to the embedding model's output dimension. Voyage-3: 1024. OpenAI text-embedding-3-small: 1536. OpenAI text-embedding-3-large: 3072. Cohere embed-english-v3: 1024.

Don't pick the dimension first and then constrain your model choice. Pick the model based on quality and cost, then set the column dimension to match.

Depends on the model and the description length. For 100-word descriptions averaging 130 tokens: 1M descriptions × 130 tokens = 130M tokens. At OpenAI text-embedding-3-small ($0.02/M tokens) that's $2.60. At Voyage-3 ($0.06/M) that's $7.80.

One-time cost for the backfill. Ongoing costs are just new-row embeddings (cheap) and re-embedding when descriptions change.

Yes. The embedding is a snapshot of the description text at embed-time. When the description changes, the embedding becomes stale and the row will rank poorly for queries that should have matched the new content.

Three sync patterns: write-time (simplest, slow writes), async-queue (best for production), periodic batch resync (simplest for low-volume edits). Pick based on your write frequency.

How to Add Semantic Search to a MySQL App

Step 1: Add the VECTOR column

Step 2: Pick an embedding model

Step 3: Backfill existing rows

Step 4: Rank with cosine similarity

Combining with regular SQL filters

Keeping embeddings in sync on writes

Performance: indexing and limits

What to do next

FAQ

See also

Sources

Ishan Karunaratne

Related posts

How to Add a Column to a MySQL Table

How to Add an SSH Key to GitHub and GitLab

How to Add an SSH Key to a Server (ssh-copy-id)

Does MySQL really support vector search natively?

Should I use MySQL VECTOR or a dedicated vector database like Pinecone?

What vector dimension should I use?

How much does it cost to embed a million product descriptions?

Do I need to re-embed when a row's description changes?

Sources

Ishan Karunaratne