How to Cut Your AI API Bill in 2026 (Every Lever That Works)

Most large AI API bills are not large because the models are expensive. They are large because the same workload is being run at the most expensive possible settings: the top-tier model when a cheaper one would do, full price when half price was available, the same instructions paid for on every call instead of cached once. There are maybe six levers that actually move an LLM bill, and they stack. Pull them in the right order and a job that cost hundreds of dollars often drops to tens: the worked example below takes a 20,000-record job from about $450 to about $30, with no change in output quality.

This is the overview. Each lever below links to a focused guide where it earns its own deep-dive.

Prices and model names here are current as of June 2026. LLM pricing changes often, so treat the numbers as ratios to reason with, not quotes to budget against, and check the provider pricing pages linked in Sources before you commit a real job.

Jump to:

The levers, ranked
Lever 1: right-size the model
Lever 2: cut the tokens, both directions
Lever 3: cache the prefix
Lever 4: batch the non-urgent work
Lever 5: skip the tool fees you do not need
Lever 6: trim long-running agents
Stacking them: one worked example
FAQ

The levers, ranked

Apply them in roughly this order. The early ones are bigger and cheaper to implement than the late ones.

Lever	Typical saving	Effort	Deep dive
Right-size the model	3x to 5x	low	Haiku vs Sonnet vs Opus
Cut output tokens	varies, often large	low	this page
Cut input tokens	varies	medium	this page
Prompt caching	up to 90% on the shared prefix	low to medium	Prompt caching
Batch processing	50% on tokens	low	LLM batch API
Skip unnecessary tool fees	removes a line item	low	this page

The numbers multiply. A 4x model saving and a 50% batch saving and a 90% prefix saving do not add up, they compound.

Lever 1: right-size the model

This is almost always the biggest lever, and the one people reach for last. The price gap between tiers is large: on Claude, Opus 4.8 is $5 per million input tokens and $25 output, Sonnet 4.6 is $3 and $15, and Haiku 4.5 is $1 and $5. That is a 5x spread on input between the cheapest and most expensive tier.

The mistake is defaulting every workload to the smartest model "to be safe." Most high-volume jobs (classification, tagging, extraction, format normalization) run fine on the cheapest tier. The discipline is to send each task to the lowest tier that actually passes it: start on Haiku, sample the output, and only move up when the cheaper model genuinely fails. The full trade-off, with where each tier wins, is in how to choose between Claude Haiku, Sonnet, and Opus.

Lever 2: cut the tokens, both directions

You pay per token in and per token out, so the cheapest token is the one you never send.

Output tokens are billed at 3x to 5x the input rate, so they are the higher-value target. Cap max_tokens to what the task needs instead of leaving it at a generous default, and ask for terse output explicitly. If you only need a category label, a JSON object, or a yes/no, say so. A prompt that returns a three-paragraph explanation when you needed one word is paying 4x output rate for 99% waste.

Input tokens add up when you stuff context "just in case." Retrieve the few documents that matter instead of pasting the whole knowledge base. Trim few-shot examples to the ones that change behavior. If the same large context repeats across calls, that is a caching problem, not a trimming problem (see Lever 3). One note for 2026: the newer Opus and Gemini tokenizers can count up to ~35% more tokens than older models for the same text, so re-measure with a token counter before assuming an old budget still holds.

Lever 3: cache the prefix

If the same system prompt, tool definitions, or RAG context goes out on call after call, you are paying full input price for identical bytes every time. Prompt caching charges a one-time write (on Claude, 1.25x input for the 5-minute cache or 2x for the 1-hour) and then about 10% of input for every subsequent read, up to a 90% saving on the cached portion. Anthropic, OpenAI, and Gemini all support it, with different mechanics.

The structural rule is simple: stable content first, dynamic content last, because the cache works on a contiguous prefix. A system prompt that ends with Current time: ${now} is uncacheable. The full mechanics (breakpoint placement, TTL trade-offs, the silent invalidators that quietly kill your hit rate, and how to verify caching is actually firing) are in how to cut LLM API costs with prompt caching.

Lever 4: batch the non-urgent work

Every major provider runs a batch API at 50% of the synchronous price. If no human is waiting for any single response (the output goes to a database, a spreadsheet, a report, or a queue), you are leaving half the money on the table by calling the synchronous endpoint in a loop.

The key distinction is that a true batch is one async job carrying many requests, not a loop that chunks work into groups of 100 but still fires individual full-price calls. And the batch discount stacks with caching, so the two levers multiply on any high-volume shared-prefix job. The full comparison across Claude, OpenAI, and Gemini, plus the gotcha that web-search fees are not batched, is in the LLM batch API guide. Two applied walk-throughs build on it: processing WordPress posts with AI at half the cost and bulk image generation with Gemini Nano Banana.

Lever 5: skip the tool fees you do not need

Server-side tools carry usage fees on top of tokens, and the batch discount does not touch them. On Claude, web search is $10 per 1,000 searches. Web fetch, by contrast, is free beyond the tokens for the content it pulls in. So if your job already has a URL and just needs the page content, fetch it rather than running a search to rediscover it. Code execution is free when paired with web search or fetch, and otherwise billed by container-hour after a monthly free allowance. The habit is to know which steps in your pipeline carry a per-call fee and design them out where the answer was already in hand.

Lever 6: trim long-running agents

For agentic workloads that run many turns, the cost is dominated by re-sending a growing history every turn. Context editing (clearing stale tool results) and compaction (summarizing old turns) keep that history from ballooning, and tuning the effort or thinking budget keeps the model from over-reasoning on simple steps. These are more involved than the levers above and matter mainly once you are running real agents rather than one-shot calls, so treat them as the advanced tier to reach for after the first five are in place.

Stacking them: one worked example

Take the recurring example: enrich 20,000 records, about 2,000 input and 500 output tokens each, where 1,200 of the input tokens are a shared instruction prefix.

Naive: Opus, synchronous, no cache. 20,000 × (2,000 × $5/M + 500 × $25/M) = $200 input + $250 output = $450.
Right-size to Sonnet. Same shape at $3 / $15 = $120 + $150 = $270. (Lever 1 alone, 40% off.)
Right-size to Haiku where the task allows it, at $1 / $5 = $40 + $50 = $90.
Batch it (50%). $45. (Lever 4.)
Cache the 1,200-token prefix. The shared input drops to roughly 10% of its cost on hits, pushing the total toward the $25 to $35 range depending on hit rate. (Lever 3.)

That is $450 down to about $30, an order of magnitude, with the bulk of it coming from picking the right model rather than any single clever trick. Batching and caching are real, but they are the polish on top of the model choice, not a substitute for it.

FAQ

Right-size the model. It is usually the largest single saving (a 3x to 5x price spread between tiers) and the lowest effort, since it is a one-line change. Default everything to the cheapest tier that passes the task, then move up only where the cheaper model genuinely fails.

Yes. A cached-token read costs about 10% of base input, and inside a batch that read still gets the 50% batch discount on top. The first cache write costs more (1.25x to 2x input on Claude), but you pay it once; after that the savings compound rather than add, which is why a high-volume job with a large shared prefix is the best case for both levers at once.

Not if you cut the right tokens. Capping max_tokens to what the task needs and asking for terse output removes waste, not substance. Trimming retrieved context to the documents that actually matter usually improves quality, because the model is not distracted by irrelevant material. The quality risk is in over-trimming few-shot examples or instructions the model relies on, so cut, measure, and keep what changes behavior.

Sometimes, but it is usually the smallest lever and the most disruptive to pull. The price spread between equivalent tiers across Anthropic, OpenAI, and Google is far narrower than the spread between tiers within one provider, and switching providers means re-testing quality and re-tuning prompts. Right-sizing, caching, and batching within your current provider almost always beat a provider swap, and they do not risk a quality regression.

How to Cut Your AI API Bill in 2026 (Every Lever That Works)

The levers, ranked

Lever 1: right-size the model

Lever 2: cut the tokens, both directions

Lever 3: cache the prefix

Lever 4: batch the non-urgent work

Lever 5: skip the tool fees you do not need

Lever 6: trim long-running agents

Stacking them: one worked example

FAQ

Sources

Ishan Karunaratne

Related posts

Cut Your LLM API Bill in Half with Batch Processing (2026)

How to Uninstall Node.js (Every Install Method)

How to Store Dates and Times in MySQL (with Time Zones)

Which lever should I pull first?

Do batching and caching discounts stack?

Does cutting tokens hurt output quality?

Is a cheaper provider a real cost lever?

Sources

Ishan Karunaratne