How to Write LLM Evals That Catch Regressions (2026)

LLM evals are unit tests for prompts. They detect when a prompt edit, a model switch, or a context-retrieval change quietly breaks a 5% slice of your traffic — the kind of regression a human reviewer would miss but a 100-case test suite catches. In 2026 the standard stack is: a golden dataset of input-output pairs that represent your real traffic, a mix of scoring methods (exact match for structured outputs, LLM-as-judge for free-form, embedding similarity for semantic equivalence), and a CI integration that runs the eval on every prompt change and fails the build if quality drops below a threshold. I'll walk the full stack with code examples.

The reason this matters: LLM behaviour is fragile in ways traditional code isn't. Change one word in a system prompt and 10% of outputs suddenly look different. Switch models and the prompt that worked yesterday produces malformed JSON today. Without evals, you ship those regressions and find out from user complaints. With evals, you catch them on the PR.

Jump to:

Building a golden dataset
Scoring methods: which one when
LLM-as-judge: the workhorse for free-form outputs
Embedding similarity for semantic match
Code: a runnable eval harness
Running evals in CI
Tracking quality over time
FAQ

Building a golden dataset

The dataset is the heart of the eval. Without good cases, the scoring doesn't matter.

Start with 50-100 cases that represent your real traffic. Mix:

Happy path: the common 60% of queries
Edge cases: 20% — long inputs, short inputs, weird characters, multilingual
Adversarial: 10% — prompt injection attempts, off-topic requests, requests the model should refuse
Regression cases: 10% — every bug you've fixed, captured as a case. When a user reports the model gave a wrong answer, add the input to the dataset before fixing.

Each case is { input: ..., expected: ..., metadata: { tags, source, etc. } }. The expected field's format depends on the scoring method (a string for exact match, a rubric for LLM-as-judge, a sample acceptable response for similarity).

Store the dataset as a checked-in JSON file in your repo. Version it. When you add cases, commit them. Treat it as documentation of "what good looks like" for this prompt.

Scoring methods: which one when

Three primary methods, each fits different output shapes:

Method	Best for	Speed	Reliability
Exact match	Structured outputs (JSON, classification labels, enums)	Instant	Perfect when applicable
LLM-as-judge	Free-form text, multi-criteria quality, "is this answer correct"	Slow (LLM call per case)	Good with the right judge prompt
Embedding similarity	"Is this semantically close to the expected answer?"	Fast	Moderate — fails on negations

Exact match is your first choice when the output is structured. If the prompt produces JSON, hash the parsed object and compare to the expected. If it classifies into 5 buckets, compare the chosen bucket. If it extracts a phone number, normalise and compare.

Embedding similarity is the next tier — for outputs where two valid responses can phrase the same thing differently. Embed the model output and the expected output, take the cosine similarity, threshold at 0.85 or so. Works for "summarise this paragraph" or "answer this factual question" cases.

LLM-as-judge is the catch-all for everything that doesn't have an objective answer. Cover the workhorse pattern below.

LLM-as-judge: the workhorse for free-form outputs

The pattern: a second LLM call evaluates the first LLM's output against criteria you specify.

code

You are an evaluator. You will be given a user question, the assistant's answer,
and (optionally) a reference answer. Score the assistant's answer on:

1. Correctness: are the facts accurate? (0-5)
2. Completeness: does it address all parts of the question? (0-5)
3. Tone: appropriate for a customer-support context? (0-5)

Return JSON: { correctness: int, completeness: int, tone: int, comments: string }

The judge prompt is itself a prompt that needs care. Best practices:

Use a strong model. Sonnet or Opus, not Haiku — the judge needs to reason carefully.
Provide rubrics, not vague criteria. "5: fully correct with no errors. 3: partially correct. 1: misleading."
Run multiple judges if stakes are high. Three independent judges, take the median score. More expensive but more reliable.
Validate your judge. Pick 20 cases, have a human grade them, compare to the LLM judge's grade. If the correlation is below 0.7, the judge prompt needs work.

The judge doesn't need to be the same model as the prompt under test — and ideally isn't. Architectural diversity reduces correlated failures. If you're prompt-testing on Claude, judge on GPT, or vice versa.

Embedding similarity for semantic match

The cheaper alternative to LLM-as-judge when responses should be "approximately the same":

python

from openai import OpenAI
import numpy as np

client = OpenAI()

def embed(text):
    return client.embeddings.create(model="text-embedding-3-small", input=text).data[0].embedding

def cosine(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def score(model_output, expected):
    return cosine(embed(model_output), embed(expected))

# Threshold at 0.85 for "close enough"
score("The capital is Paris.", "Paris is France's capital.")  # ~0.93 → pass
score("The capital is Paris.", "The capital is London.")      # ~0.78 → fail

The catch: embedding similarity is fooled by negation. "The patient is allergic to penicillin" and "The patient is not allergic to penicillin" have very similar embeddings but mean opposite things. Don't use embedding similarity for cases where negation matters; use LLM-as-judge.

Code: a runnable eval harness

A complete eval harness in 60 lines:

javascript

import Anthropic from "@anthropic-ai/sdk";
import fs from "fs/promises";
const client = new Anthropic();

const dataset = JSON.parse(await fs.readFile("eval-dataset.json", "utf8"));
const SYSTEM_PROMPT = await fs.readFile("system-prompt.md", "utf8");

async function runModel(input) {
  const response = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 512,
    system: SYSTEM_PROMPT,
    messages: [{ role: "user", content: input }],
  });
  return response.content.find((b) => b.type === "text").text;
}

async function judge(question, answer, expected) {
  const response = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 256,
    messages: [{
      role: "user",
      content: `Question: ${question}\n\nReference answer: ${expected}\n\nAssistant answer: ${answer}\n\nScore the assistant answer on correctness (0-5) and completeness (0-5). Return JSON: { correctness, completeness, comments }`,
    }],
  });
  return JSON.parse(response.content.find((b) => b.type === "text").text);
}

const results = await Promise.all(dataset.map(async (case_) => {
  const output = await runModel(case_.input);
  const score = await judge(case_.input, output, case_.expected);
  return { ...case_, output, score };
}));

const avgCorrectness = results.reduce((s, r) => s + r.score.correctness, 0) / results.length;
const passing = results.filter((r) => r.score.correctness >= 4).length;

console.log(`Average correctness: ${avgCorrectness.toFixed(2)}/5`);
console.log(`Passing (≥4): ${passing}/${results.length}`);
process.exit(avgCorrectness < 4.0 ? 1 : 0);

Run with node eval.mjs. Exits 1 if the average drops below 4.0/5 — which is what makes it a CI gate.

Running evals in CI

Add to your GitHub Actions workflow:

yaml

name: LLM Eval
on:
  pull_request:
    paths:
      - "prompts/**"
      - "src/llm/**"

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
      - run: npm ci
      - run: node eval.mjs
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Triggers on any PR that touches the prompt or LLM code. Fails the PR if the eval threshold drops. The eval runs in 30-60 seconds for a 50-case dataset.

Set the threshold based on the baseline. If your current average is 4.3/5, set the gate at 4.0 (allows minor variance, blocks real regressions). Tune over time as you add cases.

Tracking quality over time

Beyond per-PR gating, log eval scores over time. Plot them in a dashboard:

Average score per model version
Per-case score history (regression-prone cases stand out)
Cost per eval run (to detect prompt bloat)
Failure-mode breakdown (which case categories are slipping)

This is where you catch slow drift — the kind where no single PR triggers the gate but the cumulative effect over 20 PRs takes you from 4.3 to 3.8. Trends in the dashboard tell that story.

What to do next

For the prompt-engineering and model-selection decisions evals enable:

How to Write an Effective System Prompt — evals tell you which prompt is better; this covers how to write the candidate prompts.
How to Choose Between Claude Haiku, Sonnet, and Opus — evals are how you actually pick a tier on data.
How to Stop an LLM from Hallucinating — evals are how you measure whether your anti-hallucination techniques are working.

External references: Anthropic's evals documentation, OpenAI's evals library, Promptfoo is a popular open-source eval framework.

FAQ

50-100 cases is a good starting point. Below 50, the eval signal is noisy and minor changes pass or fail randomly. Above 100, you're paying real money per eval run, so add cases deliberately.

Grow the dataset over time by adding every reported bug as a case. After a year of production use, a mature dataset is typically 200-500 cases covering the full traffic distribution.

Ideally no. Different models for prompt-under-test and judge reduces correlated failures — if both hallucinate the same way, you miss the regression.

Common pattern: test on Sonnet, judge on Opus. Or test on Claude, judge on GPT. The judge needs to be at least as capable as the model being tested.

For a 50-case eval with Claude Sonnet for both inference and judging, expect roughly $0.50-$2.00 per full run, depending on prompt length. For 100 cases with Opus judging, $5-10.

Run on every prompt-changing PR. At a few dozen PRs a month, this is a small line item against a typical LLM bill — and it catches regressions that would otherwise be expensive to debug post-deploy.

For factual recall-style cases where two correct answers can be phrased differently, yes — embedding similarity is faster and cheaper. For cases where nuance, tone, or partial-correctness matters, no — embeddings can't grade that.

The right answer is usually a mix. Use embedding similarity for the cases it handles well; reserve LLM-as-judge for the cases that need real reasoning about correctness.

Start at baseline-minus-0.3 on a 5-point scale. If your current average is 4.3, gate at 4.0. Tight enough to catch real regressions, loose enough to allow normal eval variance.

Tune over time. If you see lots of false-positive failures (the eval blocks a PR that actually shipped fine), loosen. If you see real regressions slipping through, tighten.

How to Write LLM Evals That Catch Regressions

Building a golden dataset

Scoring methods: which one when

LLM-as-judge: the workhorse for free-form outputs

Embedding similarity for semantic match

Code: a runnable eval harness

Running evals in CI

Tracking quality over time

What to do next

FAQ

Ishan Karunaratne

Related posts

How to Cut LLM API Costs with Prompt Caching

How to Match an Email Address with Regex

How to Match a URL with Regex

How many cases do I need in my eval dataset?

Should I use the same model for the test and the judge?

How much does an LLM eval cost to run?

Can I use embedding similarity instead of LLM-as-judge?

What threshold should I set for the eval CI gate?

Ishan Karunaratne