The default pick for production AI work in 2026 is Claude Sonnet 4.6 — it handles 90% of real tasks well, costs about a fifth of Opus, and runs about three times faster on the same prompt. Haiku 4.5 is the right pick when you have high volume and forgiving accuracy requirements (classification, extraction, summarisation of short text). Opus 4.7 is the right pick when you need the best possible reasoning — multi-step planning, complex code refactors, anything where Sonnet has been visibly struggling. I'll walk the cost math, the latency profile, the capability gaps, and a concrete decision matrix for routing prompts across the three tiers.
The wrong question is "which model is the best." The right question is "what's the cheapest tier that gets this specific task done well enough." A production AI app that runs everything on Opus burns 5× the budget for a quality lift you probably can't measure. An app that runs everything on Haiku makes mistakes the user notices. The win is routing.
Jump to:
- The current pricing (mid-2026)
- Latency profile per tier
- Capability gaps: where each tier wins
- The default: start with Sonnet
- When to drop to Haiku
- When to step up to Opus
- Decision matrix
- Cost math: routing across tiers
- FAQ
The current pricing (mid-2026)
| Model | Input per million | Output per million | Cache hit (10%) | Notes |
|---|---|---|---|---|
| Claude Haiku 4.5 | $0.80 | $4.00 | $0.08 | Fastest, cheapest, decent reasoning |
| Claude Sonnet 4.6 | $3.00 | $15.00 | $0.30 | The workhorse — default pick |
| Claude Opus 4.7 | $15.00 | $75.00 | $1.50 | Best reasoning, slowest, most expensive |
Cache writes cost 1.25× input. Prompt caching is universally supported — covered in How to Cut LLM API Costs with Prompt Caching.
The output cost is where the budget goes. A typical workflow has a 5:1 to 10:1 output-to-input token ratio for content-generation tasks, lower for classification, higher for code generation. Opus output is 5× the cost of Sonnet output. That's why blanket Opus usage is rarely the right call.
Latency profile per tier
Time-to-first-token (TTFT) and tokens-per-second (TPS) at typical request sizes:
| Model | TTFT | TPS (output) | Typical 500-token response |
|---|---|---|---|
| Haiku 4.5 | ~250ms | ~140 | ~3.8 sec |
| Sonnet 4.6 | ~400ms | ~75 | ~7 sec |
| Opus 4.7 | ~700ms | ~45 | ~12 sec |
For interactive UX where the user is watching tokens stream, Haiku feels instant, Sonnet feels responsive, Opus feels thoughtful (read: slow). For background workflows where latency isn't user-visible, this doesn't matter. For real-time chat, it matters a lot.
Capability gaps: where each tier wins
After a year of running every kind of prompt across all three tiers in production, the practical capability map:
Haiku 4.5 is reliably good at:
- Classification (sentiment, topic, intent)
- Field extraction from structured documents
- Short summaries (paragraph → 1-2 sentences)
- Yes/no questions with a reasonably worded context
- Simple translation
- Code completion within a single function
Haiku 4.5 visibly struggles with:
- Multi-step reasoning (more than 2-3 steps)
- Long-context synthesis (>10 documents)
- Hard refactors across many files
- Anything that needs "thinking carefully"
Sonnet 4.6 is the workhorse:
- All of Haiku's strengths, plus
- Multi-step reasoning up to ~7 steps
- Long-context analysis up to ~200K tokens reliably
- Code generation that runs on the first or second try
- Most agentic workflows
- Most chatbot use cases
Sonnet 4.6 visibly struggles with:
- Deeply nested logical reasoning ("if A then B, but only if not C, unless D, in which case E")
- Novel mathematical proofs
- Very long-horizon planning (15+ step plans)
- Code refactors that touch architectural concerns
Opus 4.7 wins on:
- The hard reasoning tasks above
- Deep code refactors with cross-cutting concerns
- Plans that need to be right because each step is expensive
- Novel problem-solving where there is no obvious template
The default: start with Sonnet
For any new endpoint, prompt, or pipeline: start with Sonnet 4.6. Measure quality on a representative sample. Then ask:
- Is the quality acceptable? → keep Sonnet.
- Is the quality high-but-overkill (latency or cost matter more)? → try Haiku.
- Is the quality not good enough? → try Opus.
You'll find that for ~70% of endpoints Sonnet is the right answer. You'll move 20% down to Haiku for cost reasons, and 10% up to Opus for quality reasons. That distribution is the win.
Don't pick the tier based on intuition — measure. Run the same 50 prompts through each tier, score the outputs against a ground-truth or LLM-as-judge, and pick on data. Building an eval suite for this is covered in How to Write LLM Evals That Catch Regressions.
When to drop to Haiku
Drop to Haiku when:
- The task is structural classification. "Is this a refund request, a billing question, or a feature request?" — Haiku handles this perfectly at 1/4 the cost.
- You're extracting fields from a structured source. "Pull the order ID and the total from this email" — Haiku reads this fine.
- You have high volume. Anything running thousands of times an hour is worth a Haiku eval just to see if it survives the downgrade. A 75% cost cut at high volume is real money.
- Latency is user-facing. Streaming UX where the user is watching tokens feels noticeably better on Haiku.
Don't drop to Haiku when:
- The task involves multi-step reasoning.
- The task involves long context (>50K tokens with detail to track).
- The task is novel — Haiku is great at tasks it has seen, weaker at unusual ones.
When to step up to Opus
Step up to Opus when:
- Sonnet is producing wrong answers on the same prompt repeatedly. Not "slightly off" — visibly wrong. Run the same prompt 5 times. If 3+ outputs are wrong, Sonnet is at its limit.
- The cost of a wrong answer is high. Financial decisions, medical summaries, legal drafting — anywhere a mistake is expensive to fix.
- The task requires architectural thinking. Multi-file code refactors where the model needs to understand cross-cutting concerns.
- The plan is long and each step is expensive. Agentic workflows where running the wrong tool costs you minutes or dollars per step. The cost of Opus is small compared to the cost of running 12 bad tool calls.
Don't step up to Opus when:
- The task is high-volume and quality is "good enough" on Sonnet. The Opus cost adds up fast.
- The task is latency-sensitive. Opus is noticeably slower.
Decision matrix
| Task pattern | Recommended | Why |
|---|---|---|
| Classify support tickets | Haiku | Structural, high volume, forgiving |
| Extract fields from invoices | Haiku | Structured, repetitive |
| Summarise a long PDF | Sonnet | Long context, reasoning-light |
| Chat with a knowledge base (RAG) | Sonnet | The default; faster than Opus matters in chat |
| Generate marketing copy | Sonnet | Quality matters but Opus is overkill |
| Write a 5-file refactor | Sonnet first, Opus if it fails | Try the cheap one first |
| Plan a multi-step agent | Opus | Plans are expensive to redo |
| Critique another LLM's output | Opus | LLM-as-judge benefits from the strongest reasoner |
| Translate a sentence | Haiku | Simple, fast |
| Decide if user input contains PII | Haiku | Yes/no classification |
Cost math: routing across tiers
Concrete example. A customer-support AI handles 100K queries a day. Without routing, all queries go to Sonnet:
100,000 calls × $0.04 per call (1K input + 500 output) = $4,000/day.
With routing — Haiku handles 60% (simple lookups, classification), Sonnet handles 35% (real questions), Opus handles 5% (escalations needing deep context):
- 60,000 × $0.012 (Haiku) = $720
- 35,000 × $0.04 (Sonnet) = $1,400
- 5,000 × $0.18 (Opus) = $900
Total: $3,020/day. That's a $980/day saving (~25%) just from routing, with arguably better outcomes on the Opus-routed escalations.
The routing logic itself can be a Haiku classifier ("which tier should handle this prompt?") — adds about $0.0008 per call to compute the route, dramatically outweighed by the routing savings on the rest of the call.
What to do next
For the cost-optimisation companion technique that stacks with model routing:
- How to Cut LLM API Costs with Prompt Caching — once you've picked the right tier, caching the stable prefix is the next 90% off.
For the evaluation infrastructure you need to actually pick a tier on data instead of vibes:
- How to Write LLM Evals That Catch Regressions covers the test-suite pattern for measuring quality differences across tiers.
External reference: the Anthropic model documentation is the canonical source for current capabilities, context windows, and pricing.





