Prompt caching lets your LLM provider remember the prefix of a prompt across calls. The next time you send the same system prompt, the same tool definitions, or the same RAG context, the provider charges you a fraction of the normal input cost for the cached part. On Anthropic, cache hits cost 10% of base input (Sonnet, Haiku, Opus) and cache writes cost 1.25× input — so any prefix used at least twice within the cache TTL is a net win. On OpenAI, caching is automatic for prompts above 1,024 tokens and saves about 50% on the cached portion. On Gemini, context caching is explicit and saves around 75%. I'll walk what to cache, how to structure your prompts to maximise hits, the TTL trade-offs, and the cost math with runnable code in JavaScript and Python.
The reason this matters in 2026: production AI workflows have grown long. A retrieval-augmented chatbot might inject 50KB of context per call. A code-review assistant sends the entire system prompt plus a 200-line diff. A summarisation pipeline runs the same 8KB instructions across thousands of documents. Without caching, you pay full price for the static prefix every call. With caching, you pay full price once per TTL window and 10% thereafter.
Jump to:
- The cost math: when caching pays off
- What to cache (and what not to)
- Anthropic: explicit cache_control
- OpenAI: automatic caching above 1024 tokens
- Gemini: explicit context caching
- Code: JavaScript and Python with Anthropic SDK
- The 5-minute vs 1-hour TTL trade-off
- Common pitfalls
- FAQ
The cost math: when caching pays off
Anthropic Claude Sonnet 4.6 pricing (as of mid-2026): $3 per million input tokens, $15 per million output tokens. Prompt caching adds: $3.75 per million tokens written to cache (1.25× input), and $0.30 per million tokens read from cache (10% of input). The 5-minute cache is free to keep alive; the 1-hour cache costs 2× the write price.
Concrete example. A RAG chatbot with:
- 10,000-token system prompt + retrieved context (mostly stable across calls)
- 200-token user message (always new)
- 500-token response
Without caching, per call: 10,200 × $3/M = $0.0306 input + 500 × $15/M = $0.0075 output = $0.0381 per call.
With caching (one cache write, then hits):
- Cache write call: 10,000 × $3.75/M + 200 × $3/M + 500 × $15/M = $0.0375 + $0.0006 + $0.0075 = $0.0456 (slightly more than no-cache for the first call).
- Cache hit calls: 10,000 × $0.30/M + 200 × $3/M + 500 × $15/M = $0.003 + $0.0006 + $0.0075 = $0.0111 per call.
Break-even is at 2 calls within the cache TTL. After call 2, you save $0.027 per call (71% cheaper). At 1,000 calls per hour with the 1-hour TTL: $27 saved per hour, $648 per day, ~$237,000 per year on a single endpoint.
Caching does not pay off for endpoints with under 1,024 tokens of stable prefix or for one-shot workloads where the prompt never repeats.
What to cache (and what not to)
Cache anything stable across requests:
- System prompts. The "you are X, follow these rules" preamble.
- Tool definitions. JSON schemas for the tools the model can call.
- RAG context — when you are retrieving the same documents across a conversation, or when many users see the same knowledge base.
- Few-shot examples in the prompt.
- Conversation history in long sessions (Anthropic supports incremental caching where each new turn extends the cache).
Do not cache:
- User messages. They vary per call; caching them adds cost without saving anything.
- Per-call dynamic data like the current timestamp, request ID, or anything that changes every time.
- Outputs. Cached on the read side, not the write side.
The ordering rule: put the stable prefix first, the dynamic content last. The cache works on contiguous prefixes, so a system prompt that ends with Current time: ${now} is uncacheable — move that timestamp out of the system prompt into a user message or into the structured tool-use schema.
Anthropic: explicit cache_control
Anthropic uses an explicit cache_control marker on the content blocks you want cached. The marker tells the API "everything up to this point should be cached":
{
"model": "claude-sonnet-4-6",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "{{long_stable_context}}",
"cache_control": { "type": "ephemeral" }
},
{
"type": "text",
"text": "{{dynamic_user_message}}"
}
]
}
]
}You can place up to 4 cache breakpoints in a single request, which lets you cache the system prompt, the tool definitions, and the RAG context independently. Different cache hits cost the same; the breakpoints just give you more flexibility in mixing stable and dynamic content.
The default TTL is 5 minutes, refreshed on each hit. For RAG endpoints where the context is reused over hours, set cache_control: { type: "ephemeral", ttl: "1h" } — the cost math changes but the break-even point drops to roughly 2 requests an hour for a 10KB prefix.
OpenAI: automatic caching above 1024 tokens
OpenAI's prompt caching is automatic — no marker needed. Any prompt of 1,024 tokens or longer that matches a recently-seen prefix gets a discount of roughly 50% on the cached portion. Cache writes are free (no extra cost over the base input price), and there is no TTL control — the cache lives for some provider-determined window (typically minutes, sometimes longer for very common prefixes).
The implication for prompt structure is the same as Anthropic: put stable content first, dynamic content last, keep the prefix above 1,024 tokens. The implication for cost: less savings per cached token than Anthropic, but no per-call ceremony.
Gemini: explicit context caching
Gemini Context Caching is explicit and uses a separate API: you upload the stable context to the cache, get back a cache ID, then reference that ID in subsequent requests. Cached tokens cost roughly 25% of base input.
from google import genai
client = genai.Client()
cached = client.caches.create(
model="gemini-2.5-pro",
config={"contents": ["{{long_stable_context}}"], "ttl": "3600s"},
)
response = client.models.generate_content(
model="gemini-2.5-pro",
contents=["{{dynamic_user_message}}"],
config={"cached_content": cached.name},
)The trade-off vs Anthropic: Gemini's explicit caching is better for long-lived, infrequently-changing context (a whole legal document, a manual). Anthropic's implicit caching with TTL is better for high-frequency reads with TTL alignment to your traffic pattern.
Code: JavaScript and Python with Anthropic SDK
JavaScript:
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const SYSTEM_PROMPT = `You are a helpful technical writer...
(thousands of tokens of stable instructions and examples)
`;
const RAG_CONTEXT = await fetchKnowledgeBase();
async function ask(userMessage) {
return client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
system: [
{
type: "text",
text: SYSTEM_PROMPT,
cache_control: { type: "ephemeral" },
},
],
messages: [
{
role: "user",
content: [
{
type: "text",
text: RAG_CONTEXT,
cache_control: { type: "ephemeral" },
},
{ type: "text", text: userMessage },
],
},
],
});
}Two cache breakpoints: one on the system prompt, one on the RAG context. Both refresh on hit. Total cached content can be tens of thousands of tokens.
Python:
from anthropic import Anthropic
client = Anthropic()
SYSTEM_PROMPT = """You are a helpful technical writer..."""
RAG_CONTEXT = fetch_knowledge_base()
def ask(user_message: str):
return client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{"type": "text", "text": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}}
],
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": RAG_CONTEXT, "cache_control": {"type": "ephemeral"}},
{"type": "text", "text": user_message},
],
}
],
)Check the usage field on the response: cache_creation_input_tokens tells you what was written, cache_read_input_tokens tells you what was read from cache. Watch these in your monitoring to confirm caching is actually firing.
The 5-minute vs 1-hour TTL trade-off
Anthropic offers two TTLs: 5-minute (default) and 1-hour.
- 5-minute TTL: write cost is 1.25× input (no extra). Hit cost is 10% of input. Best for high-frequency endpoints where the same prefix is hit every few seconds — chat sessions, batch processing pipelines, tightly-scoped agents.
- 1-hour TTL: write cost is 2× input. Hit cost is 10% of input. Best for medium-frequency endpoints where the prefix is hit a few times per hour — RAG over a slowly-changing knowledge base, daily-batch workflows, low-traffic chatbots.
The break-even between 5-minute and 1-hour caching is roughly 1 call every 12 minutes. Below that frequency, the 1-hour cache pays for itself. Above, the 5-minute cache is cheaper because the write cost is lower and you would refresh anyway.
Common pitfalls
- Cache misses you cannot explain. Almost always a non-deterministic detail at the start of the prompt: timestamps, randomised IDs, user-specific personalisation. Check the prefix byte-for-byte across two consecutive calls.
- Caching a prefix that contains the user message. If the user message changes per call, no two calls share that prefix and the cache writes accumulate without ever being read. Move dynamic content out of the cached region.
- Forgetting to cache tool definitions. If your prompt has a long list of tools in the system prompt or tool-definitions section, that is one of the biggest savings opportunities — but only if the tool list is stable across calls. Define tools once at the top, before any dynamic content.
- Cache writes counted as savings. A cache write is more expensive than the no-cache baseline. If your traffic pattern is "one call per user with a new prefix every time", caching adds cost rather than removing it. Only enable caching when you have at least 2 calls per TTL window using the same prefix.
- Mixing models with the same cached content. Cache is per-model. Switching from
claude-sonnet-4-6toclaude-opus-4-7invalidates everything; the new model rewrites the cache from scratch.
What to do next
For the broader question of which Claude tier to send each prompt to:
- How to choose between Claude Haiku, Sonnet, and Opus covers the cost-quality trade-off across the three tiers and where each one wins.
For the prompt-engineering side that maximises cache hits and reduces token usage:
- How to write an effective system prompt covers the 5 parts of a production system prompt with cacheable structure.
For the broader AI tools and skills landscape:
- Top 5 MCP Servers Every Developer Should Try in 2026 covers the integration layer that often drives the most prompt-cache savings (RAG MCPs, document MCPs).
External references: the Anthropic prompt caching documentation is the canonical reference. OpenAI's automatic prompt caching announcement explains the 1,024-token threshold. Gemini context caching docs cover the explicit-cache pattern.





