TechEarl

How to Add Semantic Search to a MySQL App

Add semantic search to an existing MySQL app with MySQL 9's VECTOR type, an embedding model (Voyage, OpenAI), and application-side cosine ranking. No separate vector database needed.

Ishan Karunaratne⏱️ 12 min readUpdated
Share thisCopied
Add semantic search to an existing MySQL app with MySQL 9's VECTOR type and embeddings from Voyage or OpenAI. Store, query, and rank without a separate vector DB.

MySQL 9.0 added a native VECTOR type that lets you store embeddings alongside your existing data instead of adding Pinecone or pgvector to the stack. The integration is four steps: add a VECTOR column to the table you want to search, generate embeddings with a model like Voyage-3 or OpenAI's text-embedding-3-small, backfill the column for existing rows, then rank by cosine similarity at search time. I'll walk the full integration on a products table, the same table from the MySQL import CSV walkthrough, with code in Python and Node.js.

One thing to get straight up front, because it shapes every code sample below: the DISTANCE() similarity function and VECTOR INDEX (HNSW) DDL are MySQL HeatWave features on OCI, not part of MySQL Community or Commercial editions. Standard MySQL 9.x ships the VECTOR storage type and the STRING_TO_VECTOR() / VECTOR_TO_STRING() conversion functions, but it has no in-database cosine-distance function and no approximate-nearest-neighbour index. On a standard install you store the vectors in MySQL and compute the cosine ranking in your application code. That is exactly what this guide does. If you run HeatWave, I note the shortcuts it unlocks where they apply.

The reason this is appealing in 2026: the "use a dedicated vector database" architecture made sense in 2023 when Postgres and MySQL had no vector primitives. Now that they do, keeping embeddings and source data in the same transactional database removes a service from your stack, lets you join semantic results with regular SQL filters, and keeps everything in one backup/replication path. For workloads under ~10M vectors, in-database is the cleaner architecture.

Jump to:

Step 1: Add the VECTOR column

MySQL 9.0 introduced VECTOR(N) where N is the dimension count. The number must match your embedding model's output dimension — 1024 for Voyage-3, 1536 for OpenAI text-embedding-3-small, 3072 for text-embedding-3-large.

sql
ALTER TABLE products
ADD COLUMN description_embedding VECTOR(1024) NULL;

NULL for now — you'll backfill in step 3. After backfill, you can optionally tighten with ALTER TABLE products MODIFY description_embedding VECTOR(1024) NOT NULL; once the column is fully populated.

For the broader ADD COLUMN syntax including the modern ALGORITHM=INSTANT for fast schema changes, see How to Add a Column to a MySQL Table.

Step 2: Pick an embedding model

The major choices in 2026 (covered in detail in How to Build RAG with Embeddings and Vector Search):

  • Voyage-3 (1024 dim) — best benchmark scores on technical content, $0.06 per million tokens
  • OpenAI text-embedding-3-small (1536 dim) — solid default, $0.02 per million tokens
  • OpenAI text-embedding-3-large (3072 dim) — higher quality, $0.13 per million tokens
  • Local (Voyage-Code-3, BGE-large) — zero per-token cost, run on your own GPU

For e-commerce product search like the products example: Voyage-3 or text-embedding-3-small are both strong. Pick the cheaper one for the prototype, evaluate quality on real queries, upgrade if needed.

Match the dimension to your VECTOR(N) column. If you pick the model first, set N accordingly; if you've already created the column, pick a model with that dimension.

Step 3: Backfill existing rows

For an existing products table with thousands of rows, generate embeddings for each row's description and update the description_embedding column.

Python with OpenAI:

python
import os
from openai import OpenAI
import mysql.connector

openai = OpenAI()
db = mysql.connector.connect(
    host="localhost", user="root", password=os.environ["DB_PASS"],
    database="myshop", autocommit=False,
)
cursor = db.cursor()

# Stream products in batches
cursor.execute("SELECT sku, description FROM products WHERE description_embedding IS NULL")
batch_size = 100
batch = []
for sku, description in cursor.fetchall():
    batch.append((sku, description))
    if len(batch) >= batch_size:
        skus = [s for s, _ in batch]
        descriptions = [d for _, d in batch]
        # Batch embed (OpenAI supports up to 2048 inputs per call)
        response = openai.embeddings.create(model="text-embedding-3-small", input=descriptions)
        for (sku, _), emb in zip(batch, response.data):
            # MySQL expects vectors as a JSON-like array string
            vec_literal = "[" + ",".join(str(v) for v in emb.embedding) + "]"
            cursor.execute(
                "UPDATE products SET description_embedding = STRING_TO_VECTOR(%s) WHERE sku = %s",
                (vec_literal, sku),
            )
        db.commit()
        batch = []
# Process leftover
if batch:
    descriptions = [d for _, d in batch]
    response = openai.embeddings.create(model="text-embedding-3-small", input=descriptions)
    for (sku, _), emb in zip(batch, response.data):
        vec_literal = "[" + ",".join(str(v) for v in emb.embedding) + "]"
        cursor.execute(
            "UPDATE products SET description_embedding = STRING_TO_VECTOR(%s) WHERE sku = %s",
            (vec_literal, sku),
        )
    db.commit()

STRING_TO_VECTOR() (also spelled TO_VECTOR()) is MySQL 9's function for converting a JSON-array literal string to a VECTOR value. It is part of standard MySQL 9.x, Community edition included. Batch sizes of 100 keep the OpenAI call efficient (the API supports up to 2048 inputs per call but smaller batches recover faster from rate-limit errors).

For a 10K-row table this runs in 1-3 minutes including network. For 1M rows, plan an hour and run it from a server close to the embedding provider's region.

Step 4: Rank with cosine similarity

Standard MySQL stores the VECTOR value but has no function to compare two vectors, so the ranking happens in application code. Search-time: embed the user query, pull the candidate vectors out of MySQL with VECTOR_TO_STRING(), and score each one with cosine similarity in Python.

python
import json

def cosine_similarity(a, b):
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = sum(x * x for x in a) ** 0.5
    norm_b = sum(x * x for x in b) ** 0.5
    return dot / (norm_a * norm_b) if norm_a and norm_b else 0.0

def search(query: str, limit: int = 10):
    # Embed the query
    response = openai.embeddings.create(model="text-embedding-3-small", input=[query])
    query_vec = response.data[0].embedding

    # Pull candidate rows; VECTOR_TO_STRING returns the embedding as a JSON array string
    cursor.execute(
        """
        SELECT sku, name, description,
               VECTOR_TO_STRING(description_embedding) AS embedding
        FROM products
        WHERE description_embedding IS NOT NULL
        """
    )
    scored = []
    for sku, name, description, embedding in cursor.fetchall():
        row_vec = json.loads(embedding)
        score = cosine_similarity(query_vec, row_vec)
        scored.append((sku, name, description, score))

    # Highest similarity first
    scored.sort(key=lambda r: r[3], reverse=True)
    return scored[:limit]

cosine_similarity() returns a value from -1.0 (opposite) to 1.0 (identical); higher is more similar. VECTOR_TO_STRING() (also spelled FROM_VECTOR()) serialises the stored VECTOR back to a JSON array string that json.loads() parses into a Python list. In production, use NumPy for the math: numpy.dot(a, b) / (numpy.linalg.norm(a) * numpy.linalg.norm(b)) is faster than the pure-Python loop above and the difference matters once you score thousands of rows per query.

For a Node.js equivalent, the mysql2/promise driver works the same way: run the SELECT ... VECTOR_TO_STRING(...) query, JSON.parse each embedding, and score with a cosine function.

If you run MySQL HeatWave on OCI, you can skip the application-side math. HeatWave adds the DISTANCE(v1, v2, 'COSINE') function, so the ranking moves into SQL:

sql
SELECT sku, name, description,
       DISTANCE(description_embedding, STRING_TO_VECTOR('[...]'), 'COSINE') AS distance
FROM products
WHERE description_embedding IS NOT NULL
ORDER BY distance ASC
LIMIT 10;

DISTANCE() returns the cosine distance (0.0 = identical, 2.0 = opposite), so HeatWave queries sort ASC. This function does not exist in Community or Commercial MySQL.

Combining with regular SQL filters

The killer feature of in-database vectors: SQL filters compose with similarity ranking. Need products in stock under $100 similar to "rugged hiking boots"? You push the structured filters into the SELECT so MySQL only returns matching candidates, then rank those candidates by cosine similarity in application code.

sql
SELECT sku, name, price_cents,
       VECTOR_TO_STRING(description_embedding) AS embedding
FROM products
WHERE description_embedding IS NOT NULL
  AND in_stock = 1
  AND price_cents < 10000;

The WHERE clause does the cheap, indexable work in the database. Your Python or Node code then scores the returned rows with the cosine function from step 4 and keeps the top results. With a dedicated vector database you would fetch candidates from the vector DB, then query MySQL for stock and price, then re-merge: more services, more code, more failure modes. Keeping the vectors in MySQL collapses the structured-filter half into a single statement.

On MySQL HeatWave the whole thing is one query, because DISTANCE() can sit in the same SELECT as the WHERE filters:

sql
SELECT sku, name, price_cents,
       DISTANCE(description_embedding, STRING_TO_VECTOR(:query_vec), 'COSINE') AS distance
FROM products
WHERE description_embedding IS NOT NULL
  AND in_stock = 1
  AND price_cents < 10000
ORDER BY distance ASC
LIMIT 10;

Keeping embeddings in sync on writes

When a product's description changes, the embedding becomes stale. Three patterns to keep them in sync:

1. Write-time sync (simplest, slowest writes): in the same transaction that updates description, call the embedding API and update description_embedding. Adds 100-300ms to each write — fine for low-volume admin edits, painful for high-frequency updates.

2. Async sync via queue (preferred for production): writes update description and emit a "re-embed this row" message to a job queue (SQS, RabbitMQ, BullMQ). A worker consumes the queue, embeds the new description, updates the row. Writes stay fast; embeddings catch up within seconds.

3. Periodic batch resync: a nightly job re-embeds every row whose description has been updated since the last run. Tracked via a description_updated_at timestamp column. Simplest if your data doesn't need real-time semantic-search accuracy.

For an e-commerce catalog with thousands of writes per day, async-queue sync is the right pattern. For a small CMS with a handful of edits per week, write-time sync is fine.

Performance: indexing and limits

This is the honest constraint of semantic search on standard MySQL: there is no approximate-nearest-neighbour (ANN) index. Every search reads the candidate rows the WHERE clause selects and scores each one. On Community or Commercial MySQL 9.x the vector comparison is always a linear scan over those candidates; the VECTOR type is storage only.

So the performance story is about keeping the candidate set small:

  • Lean hard on structured filters. Real indexes on in_stock, price_cents, category_id and the like cut the candidate set before any cosine math runs. A query that narrows to 2,000 candidate rows scores in single-digit milliseconds; one that scans 500,000 does not.
  • Score with NumPy, not a Python loop. Vectorised dot products and norms over a matrix of candidate embeddings are an order of magnitude faster than the per-row pure-Python version.
  • Cache query embeddings. Repeated searches for the same phrase should not re-hit the embedding API.

For a products table where filters narrow to a few thousand candidates, application-side ranking stays comfortably interactive. Once a single unfiltered search would have to score hundreds of thousands of rows, the linear scan stops being acceptable, and that is the point to move to a real ANN-indexed store.

If you run MySQL HeatWave on OCI, that ceiling lifts: HeatWave GenAI builds HNSW (Hierarchical Navigable Small World) vector indexes automatically for frequently queried VECTOR columns, taking similarity queries from an O(N) scan to an O(log N) approximate lookup. The VECTOR INDEX DDL and HNSW indexing are HeatWave features and do not exist in Community or Commercial MySQL.

Scaling limits common to all MySQL 9.x editions: a VECTOR value holds up to 16KB (about 4000 floats at 4 bytes each). For standard MySQL the practical sweet spot for in-database semantic search is small candidate sets behind selective filters; for large unfiltered corpora, HeatWave or a dedicated vector database pulls ahead on raw query latency.

What to do next

For the broader MySQL operational toolkit this semantic-search feature plugs into:

For the AI side that produces the embeddings:

External references: MySQL 9 VECTOR documentation, Voyage AI embeddings, OpenAI embeddings.

FAQ

See also

Sources

Authoritative references this article was fact-checked against.

TagsMySQLVector SearchSemantic SearchEmbeddingsAIVECTOR type

Found this useful? Pass it on.

Copied

Ishan Karunaratne

Software Systems Architect · Senior Software Engineer · Engineering Leadership

Software systems architect and senior software engineer with more than two decades designing, building, and running production software, Linux systems, and DevOps infrastructure, and lately working AI into the stack. Now a CTO, though what I write here is drawn from the full arc of that work, across architecture, engineering, and operations, not any single job.

Keep reading

Related posts

Macro photograph of a printer's typecase drawer with brass-and-wood compartments, one new compartment freshly added at the end, single warm side light

How to Add a Column to a MySQL Table

Add a column to a MySQL table with ALTER TABLE ADD COLUMN. Covers DEFAULT values, NOT NULL on existing rows, AFTER positioning, and ALGORITHM=INSTANT on MySQL 8.0.12+.