LLM evals are unit tests for prompts. They detect when a prompt edit, a model switch, or a context-retrieval change quietly breaks a 5% slice of your traffic — the kind of regression a human reviewer would miss but a 100-case test suite catches. In 2026 the standard stack is: a golden dataset of input-output pairs that represent your real traffic, a mix of scoring methods (exact match for structured outputs, LLM-as-judge for free-form, embedding similarity for semantic equivalence), and a CI integration that runs the eval on every prompt change and fails the build if quality drops below a threshold. I'll walk the full stack with code examples.
The reason this matters: LLM behaviour is fragile in ways traditional code isn't. Change one word in a system prompt and 10% of outputs suddenly look different. Switch models and the prompt that worked yesterday produces malformed JSON today. Without evals, you ship those regressions and find out from user complaints. With evals, you catch them on the PR.
Jump to:
- Building a golden dataset
- Scoring methods: which one when
- LLM-as-judge: the workhorse for free-form outputs
- Embedding similarity for semantic match
- Code: a runnable eval harness
- Running evals in CI
- Tracking quality over time
- FAQ
Building a golden dataset
The dataset is the heart of the eval. Without good cases, the scoring doesn't matter.
Start with 50-100 cases that represent your real traffic. Mix:
- Happy path: the common 60% of queries
- Edge cases: 20% — long inputs, short inputs, weird characters, multilingual
- Adversarial: 10% — prompt injection attempts, off-topic requests, requests the model should refuse
- Regression cases: 10% — every bug you've fixed, captured as a case. When a user reports the model gave a wrong answer, add the input to the dataset before fixing.
Each case is { input: ..., expected: ..., metadata: { tags, source, etc. } }. The expected field's format depends on the scoring method (a string for exact match, a rubric for LLM-as-judge, a sample acceptable response for similarity).
Store the dataset as a checked-in JSON file in your repo. Version it. When you add cases, commit them. Treat it as documentation of "what good looks like" for this prompt.
Scoring methods: which one when
Three primary methods, each fits different output shapes:
| Method | Best for | Speed | Reliability |
|---|---|---|---|
| Exact match | Structured outputs (JSON, classification labels, enums) | Instant | Perfect when applicable |
| LLM-as-judge | Free-form text, multi-criteria quality, "is this answer correct" | Slow (LLM call per case) | Good with the right judge prompt |
| Embedding similarity | "Is this semantically close to the expected answer?" | Fast | Moderate — fails on negations |
Exact match is your first choice when the output is structured. If the prompt produces JSON, hash the parsed object and compare to the expected. If it classifies into 5 buckets, compare the chosen bucket. If it extracts a phone number, normalise and compare.
Embedding similarity is the next tier — for outputs where two valid responses can phrase the same thing differently. Embed the model output and the expected output, take the cosine similarity, threshold at 0.85 or so. Works for "summarise this paragraph" or "answer this factual question" cases.
LLM-as-judge is the catch-all for everything that doesn't have an objective answer. Cover the workhorse pattern below.
LLM-as-judge: the workhorse for free-form outputs
The pattern: a second LLM call evaluates the first LLM's output against criteria you specify.
You are an evaluator. You will be given a user question, the assistant's answer,
and (optionally) a reference answer. Score the assistant's answer on:
1. Correctness: are the facts accurate? (0-5)
2. Completeness: does it address all parts of the question? (0-5)
3. Tone: appropriate for a customer-support context? (0-5)
Return JSON: { correctness: int, completeness: int, tone: int, comments: string }
The judge prompt is itself a prompt that needs care. Best practices:
- Use a strong model. Sonnet or Opus, not Haiku — the judge needs to reason carefully.
- Provide rubrics, not vague criteria. "5: fully correct with no errors. 3: partially correct. 1: misleading."
- Run multiple judges if stakes are high. Three independent judges, take the median score. More expensive but more reliable.
- Validate your judge. Pick 20 cases, have a human grade them, compare to the LLM judge's grade. If the correlation is below 0.7, the judge prompt needs work.
The judge doesn't need to be the same model as the prompt under test — and ideally isn't. Architectural diversity reduces correlated failures. If you're prompt-testing on Claude, judge on GPT, or vice versa.
Embedding similarity for semantic match
The cheaper alternative to LLM-as-judge when responses should be "approximately the same":
from openai import OpenAI
import numpy as np
client = OpenAI()
def embed(text):
return client.embeddings.create(model="text-embedding-3-small", input=text).data[0].embedding
def cosine(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def score(model_output, expected):
return cosine(embed(model_output), embed(expected))
# Threshold at 0.85 for "close enough"
score("The capital is Paris.", "Paris is France's capital.") # ~0.93 → pass
score("The capital is Paris.", "The capital is London.") # ~0.78 → failThe catch: embedding similarity is fooled by negation. "The patient is allergic to penicillin" and "The patient is not allergic to penicillin" have very similar embeddings but mean opposite things. Don't use embedding similarity for cases where negation matters; use LLM-as-judge.
Code: a runnable eval harness
A complete eval harness in 60 lines:
import Anthropic from "@anthropic-ai/sdk";
import fs from "fs/promises";
const client = new Anthropic();
const dataset = JSON.parse(await fs.readFile("eval-dataset.json", "utf8"));
const SYSTEM_PROMPT = await fs.readFile("system-prompt.md", "utf8");
async function runModel(input) {
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 512,
system: SYSTEM_PROMPT,
messages: [{ role: "user", content: input }],
});
return response.content.find((b) => b.type === "text").text;
}
async function judge(question, answer, expected) {
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 256,
messages: [{
role: "user",
content: `Question: ${question}\n\nReference answer: ${expected}\n\nAssistant answer: ${answer}\n\nScore the assistant answer on correctness (0-5) and completeness (0-5). Return JSON: { correctness, completeness, comments }`,
}],
});
return JSON.parse(response.content.find((b) => b.type === "text").text);
}
const results = await Promise.all(dataset.map(async (case_) => {
const output = await runModel(case_.input);
const score = await judge(case_.input, output, case_.expected);
return { ...case_, output, score };
}));
const avgCorrectness = results.reduce((s, r) => s + r.score.correctness, 0) / results.length;
const passing = results.filter((r) => r.score.correctness >= 4).length;
console.log(`Average correctness: ${avgCorrectness.toFixed(2)}/5`);
console.log(`Passing (≥4): ${passing}/${results.length}`);
process.exit(avgCorrectness < 4.0 ? 1 : 0);Run with node eval.mjs. Exits 1 if the average drops below 4.0/5 — which is what makes it a CI gate.
Running evals in CI
Add to your GitHub Actions workflow:
name: LLM Eval
on:
pull_request:
paths:
- "prompts/**"
- "src/llm/**"
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
- run: npm ci
- run: node eval.mjs
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}Triggers on any PR that touches the prompt or LLM code. Fails the PR if the eval threshold drops. The eval runs in 30-60 seconds for a 50-case dataset.
Set the threshold based on the baseline. If your current average is 4.3/5, set the gate at 4.0 (allows minor variance, blocks real regressions). Tune over time as you add cases.
Tracking quality over time
Beyond per-PR gating, log eval scores over time. Plot them in a dashboard:
- Average score per model version
- Per-case score history (regression-prone cases stand out)
- Cost per eval run (to detect prompt bloat)
- Failure-mode breakdown (which case categories are slipping)
This is where you catch slow drift — the kind where no single PR triggers the gate but the cumulative effect over 20 PRs takes you from 4.3 to 3.8. Trends in the dashboard tell that story.
What to do next
For the prompt-engineering and model-selection decisions evals enable:
- How to Write an Effective System Prompt — evals tell you which prompt is better; this covers how to write the candidate prompts.
- How to Choose Between Claude Haiku, Sonnet, and Opus — evals are how you actually pick a tier on data.
- How to Stop an LLM from Hallucinating — evals are how you measure whether your anti-hallucination techniques are working.
External references: Anthropic's evals documentation, OpenAI's evals library, Promptfoo is a popular open-source eval framework.





