How to Cut LLM API Costs with Prompt Caching (2026)

Prompt caching lets your LLM provider remember the prefix of a prompt across calls. The next time you send the same system prompt, the same tool definitions, or the same RAG context, the provider charges you a fraction of the normal input cost for the cached part. On Anthropic, cache hits cost 10% of base input (Sonnet, Haiku, Opus) and cache writes cost 1.25× input — so any prefix used at least twice within the cache TTL is a net win. On OpenAI, caching is automatic for prompts above 1,024 tokens and the cached-input discount is model-specific: current models bill cached tokens at roughly 10% of base input, so up to 90% off. On Gemini, context caching is explicit and on the 2.5 model family the cached-token discount is roughly 90%. I'll walk what to cache, how to structure your prompts to maximise hits, the TTL trade-offs, and the cost math with runnable code in JavaScript and Python.

The reason this matters in 2026: production AI workflows have grown long. A retrieval-augmented chatbot might inject 50KB of context per call. A code-review assistant sends the entire system prompt plus a 200-line diff. A summarisation pipeline runs the same 8KB instructions across thousands of documents. Without caching, you pay full price for the static prefix every call. With caching, you pay full price once per TTL window and 10% thereafter.

Jump to:

The cost math: when caching pays off
What to cache (and what not to)
Anthropic: explicit cache_control
OpenAI: automatic caching above 1024 tokens
Gemini: explicit context caching
Code: JavaScript and Python with Anthropic SDK
The 5-minute vs 1-hour TTL trade-off
Common pitfalls
FAQ

The cost math: when caching pays off

Anthropic Claude Sonnet 4.6 pricing (as of mid-2026): $3 per million input tokens, $15 per million output tokens. Prompt caching adds: $3.75 per million tokens written to cache (1.25× input), and $0.30 per million tokens read from cache (10% of input). The 5-minute cache is free to keep alive; the 1-hour cache costs 2× the write price.

Concrete example. A RAG chatbot with:

10,000-token system prompt + retrieved context (mostly stable across calls)
200-token user message (always new)
500-token response

Without caching, per call: 10,200 × $3/M = $0.0306 input + 500 × $15/M = $0.0075 output = $0.0381 per call.

With caching (one cache write, then hits):

Cache write call: 10,000 × $3.75/M + 200 × $3/M + 500 × $15/M = $0.0375 + $0.0006 + $0.0075 = $0.0456 (slightly more than no-cache for the first call).
Cache hit calls: 10,000 × $0.30/M + 200 × $3/M + 500 × $15/M = $0.003 + $0.0006 + $0.0075 = $0.0111 per call.

Break-even is at 2 calls within the cache TTL. After call 2, you save $0.027 per call (71% cheaper). At 1,000 calls per hour with the 1-hour TTL: $27 saved per hour, $648 per day, ~$237,000 per year on a single endpoint.

Caching does not pay off for endpoints with under 1,024 tokens of stable prefix or for one-shot workloads where the prompt never repeats.

What to cache (and what not to)

Cache anything stable across requests:

System prompts. The "you are X, follow these rules" preamble.
Tool definitions. JSON schemas for the tools the model can call.
RAG context — when you are retrieving the same documents across a conversation, or when many users see the same knowledge base.
Few-shot examples in the prompt.
Conversation history in long sessions (Anthropic supports incremental caching where each new turn extends the cache).

Do not cache:

User messages. They vary per call; caching them adds cost without saving anything.
Per-call dynamic data like the current timestamp, request ID, or anything that changes every time.
Outputs. Cached on the read side, not the write side.

The ordering rule: put the stable prefix first, the dynamic content last. The cache works on contiguous prefixes, so a system prompt that ends with Current time: ${now} is uncacheable — move that timestamp out of the system prompt into a user message or into the structured tool-use schema.

Anthropic: explicit cache_control

Anthropic uses an explicit cache_control marker on the content blocks you want cached. The marker tells the API "everything up to this point should be cached":

json

{
  "model": "claude-sonnet-4-6",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "{{long_stable_context}}",
          "cache_control": { "type": "ephemeral" }
        },
        {
          "type": "text",
          "text": "{{dynamic_user_message}}"
        }
      ]
    }
  ]
}

You can place up to 4 cache breakpoints in a single request, which lets you cache the system prompt, the tool definitions, and the RAG context independently. Different cache hits cost the same; the breakpoints just give you more flexibility in mixing stable and dynamic content.

The default TTL is 5 minutes, refreshed on each hit. For RAG endpoints where the context is reused over hours, set cache_control: { type: "ephemeral", ttl: "1h" } — the cost math changes but the break-even point drops to roughly 2 requests an hour for a 10KB prefix.

OpenAI: automatic caching above 1024 tokens

OpenAI's prompt caching is automatic, no marker needed. Any prompt of 1,024 tokens or longer that matches a recently-seen prefix gets a discount on the cached portion. The discount is model-specific, and current models bill cached tokens at roughly 10% of base input, i.e. up to 90% off. Cache writes are free (no extra cost over the base input price), and there is no TTL control: the cache lives for some provider-determined window (typically minutes, sometimes longer for very common prefixes).

The implication for prompt structure is the same as Anthropic: put stable content first, dynamic content last, keep the prefix above 1,024 tokens. The implication for cost: comparable savings per cached token to Anthropic on current model families, with no per-call ceremony.

Gemini: explicit context caching

Gemini Context Caching is explicit and uses a separate API: you upload the stable context to the cache, get back a cache ID, then reference that ID in subsequent requests. On the Gemini 2.5 model family, cached tokens cost roughly 10% of base input (about a 90% discount), plus a separate storage fee billed per MTok-hour for as long as the cache is kept alive.

python

from google import genai

client = genai.Client()
cached = client.caches.create(
    model="gemini-2.5-pro",
    config={"contents": ["{{long_stable_context}}"], "ttl": "3600s"},
)
response = client.models.generate_content(
    model="gemini-2.5-pro",
    contents=["{{dynamic_user_message}}"],
    config={"cached_content": cached.name},
)

The trade-off vs Anthropic: Gemini's explicit caching is better for long-lived, infrequently-changing context (a whole legal document, a manual). Anthropic's implicit caching with TTL is better for high-frequency reads with TTL alignment to your traffic pattern.

Code: JavaScript and Python with Anthropic SDK

JavaScript:

javascript

import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();

const SYSTEM_PROMPT = `You are a helpful technical writer...
(thousands of tokens of stable instructions and examples)
`;

const RAG_CONTEXT = await fetchKnowledgeBase();

async function ask(userMessage) {
  return client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    system: [
      {
        type: "text",
        text: SYSTEM_PROMPT,
        cache_control: { type: "ephemeral" },
      },
    ],
    messages: [
      {
        role: "user",
        content: [
          {
            type: "text",
            text: RAG_CONTEXT,
            cache_control: { type: "ephemeral" },
          },
          { type: "text", text: userMessage },
        ],
      },
    ],
  });
}

Two cache breakpoints: one on the system prompt, one on the RAG context. Both refresh on hit. Total cached content can be tens of thousands of tokens.

Python:

python

from anthropic import Anthropic
client = Anthropic()

SYSTEM_PROMPT = """You are a helpful technical writer..."""
RAG_CONTEXT = fetch_knowledge_base()

def ask(user_message: str):
    return client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=[
            {"type": "text", "text": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}}
        ],
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": RAG_CONTEXT, "cache_control": {"type": "ephemeral"}},
                    {"type": "text", "text": user_message},
                ],
            }
        ],
    )

Check the usage field on the response: cache_creation_input_tokens tells you what was written, cache_read_input_tokens tells you what was read from cache. Watch these in your monitoring to confirm caching is actually firing.

The 5-minute vs 1-hour TTL trade-off

Anthropic offers two TTLs: 5-minute (default) and 1-hour.

5-minute TTL: write cost is 1.25× input (no extra). Hit cost is 10% of input. Best for high-frequency endpoints where the same prefix is hit every few seconds — chat sessions, batch processing pipelines, tightly-scoped agents.
1-hour TTL: write cost is 2× input. Hit cost is 10% of input. Best for medium-frequency endpoints where the prefix is hit a few times per hour — RAG over a slowly-changing knowledge base, daily-batch workflows, low-traffic chatbots.

The break-even between 5-minute and 1-hour caching is roughly 1 call every 12 minutes. Below that frequency, the 1-hour cache pays for itself. Above, the 5-minute cache is cheaper because the write cost is lower and you would refresh anyway.

Common pitfalls

Cache misses you cannot explain. Almost always a non-deterministic detail at the start of the prompt: timestamps, randomised IDs, user-specific personalisation. Check the prefix byte-for-byte across two consecutive calls.
Caching a prefix that contains the user message. If the user message changes per call, no two calls share that prefix and the cache writes accumulate without ever being read. Move dynamic content out of the cached region.
Forgetting to cache tool definitions. If your prompt has a long list of tools in the system prompt or tool-definitions section, that is one of the biggest savings opportunities — but only if the tool list is stable across calls. Define tools once at the top, before any dynamic content.
Cache writes counted as savings. A cache write is more expensive than the no-cache baseline. If your traffic pattern is "one call per user with a new prefix every time", caching adds cost rather than removing it. Only enable caching when you have at least 2 calls per TTL window using the same prefix.
Mixing models with the same cached content. Cache is per-model. Switching from claude-sonnet-4-6 to claude-opus-4-7 invalidates everything; the new model rewrites the cache from scratch.

What to do next

Prompt caching is one cost lever. The full set, and the order to pull them, is in how to cut your AI API bill. The lever that stacks hardest with caching:

The LLM batch API is 50% off, and the discount combines with caching, so a high-volume job with a shared prefix gets both reductions at once.

For the broader question of which Claude tier to send each prompt to:

How to choose between Claude Haiku, Sonnet, and Opus covers the cost-quality trade-off across the three tiers and where each one wins.

For the prompt-engineering side that maximises cache hits and reduces token usage:

How to write an effective system prompt covers the 5 parts of a production system prompt with cacheable structure.
How to get reliable JSON from an LLM covers the structured-output and schema techniques worth baking into the cached prefix, since the JSON-shaping instructions and tool schemas are exactly the stable content you want to cache once and reuse.

For the broader AI tools and skills landscape:

Top 5 MCP Servers Every Developer Should Try in 2026 covers the integration layer that often drives the most prompt-cache savings (RAG MCPs, document MCPs).

External references: the Anthropic prompt caching documentation is the canonical reference. OpenAI's automatic prompt caching announcement explains the 1,024-token threshold. Gemini context caching docs cover the explicit-cache pattern.

FAQ

Yes, modestly. On Anthropic, cache hits reduce time-to-first-token by 20-50% on prompts with large stable prefixes, because the model does not have to re-process the cached tokens. The output-token generation speed is unchanged.

The bigger latency wins come from caching tool definitions (which the model uses for routing) and RAG context (which is otherwise re-tokenised and re-processed every call).

No. The cache is keyed on the exact model identifier. Switching from claude-sonnet-4-6 to claude-opus-4-7 invalidates the cache; the next call writes a fresh cache for the new model.

If you route a single endpoint across multiple models (e.g., Sonnet for most queries, Opus for hard ones), each model maintains its own cache. The cost math assumes the same model handles enough requests within a TTL window to amortise the write.

Different trade-offs, but the cost gap has closed. OpenAI's automatic caching is zero-effort, and on current model families it gives up to 90% savings on the cached portion (cached tokens billed at roughly 10% of base input). Anthropic's explicit caching needs cache markers in your prompt and also gives 90% savings on hits.

The savings per cached token are now comparable. The real difference is the API surface: Anthropic's explicit markers give you precise control over what gets cached and a choice of TTL, while OpenAI's automatic caching wins for one-off or low-volume workloads because there is nothing to manage.

Yes, and tool definitions are one of the best things to cache. Long tool schemas (a dozen tools with detailed parameter descriptions) can be tens of thousands of tokens combined; caching them recovers the cost of a verbose tool-use setup.

On Anthropic, place the cache_control marker after the tool definitions in the tools array. On OpenAI, tool definitions are part of the prompt prefix and automatically cached above the 1,024-token threshold.

No, on any of the major providers. Caches are scoped per organisation and per API key; another customer's cache is never readable from your account. Within your account, all your traffic shares the cache namespace, which is what enables high-volume cache hits.

For multi-tenant applications where different users see different RAG contexts, structure your prompts so the per-user content is outside the cached region. Otherwise users on different cache states will trip the cache and the savings collapse.

How to Cut LLM API Costs with Prompt Caching

The cost math: when caching pays off

What to cache (and what not to)

Anthropic: explicit cache_control

OpenAI: automatic caching above 1024 tokens

Gemini: explicit context caching

Code: JavaScript and Python with Anthropic SDK

The 5-minute vs 1-hour TTL trade-off

Common pitfalls

What to do next

FAQ

Sources

Ishan Karunaratne

Related posts

How to Exclude Matches with grep -v (Invert Match)

How to Write LLM Evals That Catch Regressions

How to Crack NTLM Hashes with Hashcat

Does prompt caching reduce latency in addition to cost?

Can I cache the same prefix across multiple Claude models?

Is OpenAI's automatic caching as good as Anthropic's explicit caching?

Does prompt caching work with tool use?

Can prompt caching leak data between users?

Sources

Ishan Karunaratne