The default pick for production AI work in 2026 is Claude Sonnet 4.6 — it handles 90% of real tasks well, costs about three-fifths of what Opus does, and is noticeably faster on the same prompt. Haiku 4.5 is the right pick when you have high volume and forgiving accuracy requirements (classification, extraction, summarisation of short text). Opus 4.7 is the right pick when you need the best possible reasoning — multi-step planning, complex code refactors, anything where Sonnet has been visibly struggling. I'll walk the cost math, the latency profile, the capability gaps, and a concrete decision matrix for routing prompts across the three tiers.
The wrong question is "which model is the best." The right question is "what's the cheapest tier that gets this specific task done well enough." A production AI app that runs everything on Opus pays roughly 1.7× a pure-Sonnet bill (and far more than a routed one) for a quality lift you probably can't measure on most tasks. An app that runs everything on Haiku makes mistakes the user notices. The win is routing.
Jump to:
- The current pricing (mid-2026)
- Latency profile per tier
- Capability gaps: where each tier wins
- The default: start with Sonnet
- When to drop to Haiku
- When to step up to Opus
- Decision matrix
- Cost math: routing across tiers
- FAQ
The current pricing (mid-2026)
| Model | Input per million | Output per million | Cache hit (10%) | Notes |
|---|---|---|---|---|
| Claude Haiku 4.5 | $1.00 | $5.00 | $0.10 | Fastest, cheapest, decent reasoning |
| Claude Sonnet 4.6 | $3.00 | $15.00 | $0.30 | The workhorse — default pick |
| Claude Opus 4.7 | $5.00 | $25.00 | $0.50 | Best reasoning, slowest, most expensive |
Cache writes cost 1.25× input for the 5-minute TTL, 2× for the 1-hour TTL. Prompt caching is supported on all current Claude API models — covered in How to Cut LLM API Costs with Prompt Caching.
For generation-heavy workloads, output cost is where the budget goes — a typical content-generation task has a 5:1 to 10:1 output-to-input token ratio (lower for classification and extraction, where input dominates). Opus output costs $25 per million tokens versus Sonnet's $15, about 1.7×. Blanket Opus usage is rarely the right call.
Latency profile per tier
Time-to-first-token (TTFT) and tokens-per-second (TPS) vary by region, traffic, prompt size, output length, and service tier — Anthropic does not publish official figures. The numbers below are approximate, observed at typical request sizes. Treat the ranking as solid and the exact values as ballpark:
| Model | TTFT | TPS (output) | Typical 500-token response |
|---|---|---|---|
| Haiku 4.5 | ~250ms | ~140 | ~3.8 sec |
| Sonnet 4.6 | ~400ms | ~75 | ~7 sec |
| Opus 4.7 | ~700ms | ~45 | ~12 sec |
For interactive UX where the user is watching tokens stream, Haiku feels instant, Sonnet feels responsive, Opus feels thoughtful (read: slow). For background workflows where latency isn't user-visible, this doesn't matter. For real-time chat, it matters a lot.
Capability gaps: where each tier wins
After a year of running every kind of prompt across all three tiers in production, the practical capability map:
Haiku 4.5 is reliably good at:
- Classification (sentiment, topic, intent)
- Field extraction from structured documents
- Short summaries (paragraph → 1-2 sentences)
- Yes/no questions with a reasonably worded context
- Simple translation
- Code completion within a single function
Haiku 4.5 visibly struggles with:
- Multi-step reasoning (more than 2-3 steps)
- Long-context synthesis (>10 documents)
- Hard refactors across many files
- Anything that needs "thinking carefully"
Sonnet 4.6 is the workhorse:
- All of Haiku's strengths, plus
- Multi-step reasoning up to ~7 steps
- Long-context analysis well past 200K tokens — Sonnet 4.6 ships a 1M-token window
- Code generation that runs on the first or second try
- Most agentic workflows
- Most chatbot use cases
Sonnet 4.6 visibly struggles with:
- Deeply nested logical reasoning ("if A then B, but only if not C, unless D, in which case E")
- Novel mathematical proofs
- Very long-horizon planning (15+ step plans)
- Code refactors that touch architectural concerns
Opus 4.7 wins on:
- The hard reasoning tasks above
- Deep code refactors with cross-cutting concerns
- Plans that need to be right because each step is expensive
- Novel problem-solving where there is no obvious template
The default: start with Sonnet
For any new endpoint, prompt, or pipeline: start with Sonnet 4.6. Measure quality on a representative sample. Then ask:
- Is the quality acceptable? → keep Sonnet.
- Is the quality high-but-overkill (latency or cost matter more)? → try Haiku.
- Is the quality not good enough? → try Opus.
You'll find that for ~70% of endpoints Sonnet is the right answer. You'll move 20% down to Haiku for cost reasons, and 10% up to Opus for quality reasons. That distribution is the win.
Don't pick the tier based on intuition — measure. Run the same 50 prompts through each tier, score the outputs against a ground-truth or LLM-as-judge, and pick on data. Building an eval suite for this is covered in How to Write LLM Evals That Catch Regressions.
When to drop to Haiku
Drop to Haiku when:
- The task is structural classification. "Is this a refund request, a billing question, or a feature request?" — Haiku handles this perfectly at a third of Sonnet's cost.
- You're extracting fields from a structured source. "Pull the order ID and the total from this email" — Haiku reads this fine.
- You have high volume. Anything running thousands of times an hour is worth a Haiku eval just to see if it survives the downgrade. A two-thirds cost cut at high volume is real money.
- Latency is user-facing. Streaming UX where the user is watching tokens feels noticeably better on Haiku.
Don't drop to Haiku when:
- The task involves multi-step reasoning.
- The task involves long context (>50K tokens with detail to track).
- The task is novel — Haiku is great at tasks it has seen, weaker at unusual ones.
When to step up to Opus
Step up to Opus when:
- Sonnet is producing wrong answers on the same prompt repeatedly. Not "slightly off" — visibly wrong. Run the same prompt 5 times. If 3+ outputs are wrong, Sonnet is at its limit.
- The cost of a wrong answer is high. Financial decisions, medical summaries, legal drafting — anywhere a mistake is expensive to fix.
- The task requires architectural thinking. Multi-file code refactors where the model needs to understand cross-cutting concerns.
- The plan is long and each step is expensive. Agentic workflows where running the wrong tool costs you minutes or dollars per step. The cost of Opus is small compared to the cost of running 12 bad tool calls.
Don't step up to Opus when:
- The task is high-volume and quality is "good enough" on Sonnet. The Opus cost adds up fast.
- The task is latency-sensitive. Opus is noticeably slower.
Decision matrix
| Task pattern | Recommended | Why |
|---|---|---|
| Classify support tickets | Haiku | Structural, high volume, forgiving |
| Extract fields from invoices | Haiku | Structured, repetitive |
| Summarise a long PDF | Sonnet | Long context, reasoning-light |
| Chat with a knowledge base (RAG) | Sonnet | The default; faster than Opus matters in chat |
| Generate marketing copy | Sonnet | Quality matters but Opus is overkill |
| Write a 5-file refactor | Sonnet first, Opus if it fails | Try the cheap one first |
| Plan a multi-step agent | Opus | Plans are expensive to redo |
| Critique another LLM's output | Opus | LLM-as-judge benefits from the strongest reasoner |
| Translate a sentence | Haiku | Simple, fast |
| Decide if user input contains PII | Haiku | Yes/no classification |
Cost math: routing across tiers
Concrete example. A customer-support AI handles 100,000 queries a day. Assume each call is about 1K input tokens and 500 output tokens. The per-call cost on each tier:
- Haiku 4.5 — (1,000 × $1 + 500 × $5) / 1,000,000 = $0.0035 per call
- Sonnet 4.6 — (1,000 × $3 + 500 × $15) / 1,000,000 = $0.0105 per call
- Opus 4.7 — (1,000 × $5 + 500 × $25) / 1,000,000 = $0.0175 per call
Without routing, every query goes to Sonnet:
100,000 calls × $0.0105 = $1,050/day.
With routing — Haiku handles 60% (simple lookups, classification), Sonnet handles 35% (real questions), Opus handles 5% (escalations that need harder reasoning):
- 60,000 × $0.0035 (Haiku) = $210.00
- 35,000 × $0.0105 (Sonnet) = $367.50
- 5,000 × $0.0175 (Opus) = $87.50
Total: $665/day. That's a $385/day saving (~37%) just from routing, with arguably better outcomes on the Opus-routed escalations.
The routing logic itself can be a Haiku classifier ("which tier should handle this prompt?"). A short classification call costs about $0.0004 — running it on all 100,000 calls adds roughly $40/day, dwarfed by the $385/day the routing saves.
What to do next
For the cost-optimisation companion technique that stacks with model routing:
- How to Cut LLM API Costs with Prompt Caching — once you've picked the right tier, caching the stable prefix is the next 90% off.
For the evaluation infrastructure you need to actually pick a tier on data instead of vibes:
- How to Write LLM Evals That Catch Regressions covers the test-suite pattern for measuring quality differences across tiers.
External reference: the Anthropic model documentation is the canonical source for current capabilities, context windows, and pricing.
FAQ
See also
- How to Build RAG with Embeddings and Vector Search: the retrieval pattern that lets you use a cheaper Claude model (Haiku) without sacrificing answer quality
- How to Build an LLM Agent with Tool Use: when the workload is tool calls in a loop, the model choice math changes
- How to Cut LLM API Costs with Prompt Caching: the cost picture for each Claude model shifts dramatically once prompt caching is in play
- How to Run a Local LLM with Ollama: when none of Haiku, Sonnet, or Opus fit your privacy or cost budget, a local model is the fallback
- How to Get Reliable JSON from an LLM: how the choice of model interacts with structured-output reliability
Sources
Authoritative references this article was fact-checked against.
- Claude — model pricing (docs)platform.claude.com
- Claude Sonnet 4.6 release — Anthropic newsanthropic.com





