TechEarl

How to Cut LLM API Costs with Prompt Caching

Cut LLM API costs by 90% with prompt caching on Anthropic, OpenAI, and Gemini. The math, what to cache, the TTL trade-offs, and code in JavaScript and Python.

Ishan KarunaratneIshan Karunaratne⏱️ 11 min readUpdated
Cut LLM API costs by 90% with prompt caching. Anthropic, OpenAI, Gemini support compared. The cost math, what to cache, TTL trade-offs, code examples.

Prompt caching lets your LLM provider remember the prefix of a prompt across calls. The next time you send the same system prompt, the same tool definitions, or the same RAG context, the provider charges you a fraction of the normal input cost for the cached part. On Anthropic, cache hits cost 10% of base input (Sonnet, Haiku, Opus) and cache writes cost 1.25× input — so any prefix used at least twice within the cache TTL is a net win. On OpenAI, caching is automatic for prompts above 1,024 tokens and saves about 50% on the cached portion. On Gemini, context caching is explicit and saves around 75%. I'll walk what to cache, how to structure your prompts to maximise hits, the TTL trade-offs, and the cost math with runnable code in JavaScript and Python.

The reason this matters in 2026: production AI workflows have grown long. A retrieval-augmented chatbot might inject 50KB of context per call. A code-review assistant sends the entire system prompt plus a 200-line diff. A summarisation pipeline runs the same 8KB instructions across thousands of documents. Without caching, you pay full price for the static prefix every call. With caching, you pay full price once per TTL window and 10% thereafter.

Jump to:

The cost math: when caching pays off

Anthropic Claude Sonnet 4.6 pricing (as of mid-2026): $3 per million input tokens, $15 per million output tokens. Prompt caching adds: $3.75 per million tokens written to cache (1.25× input), and $0.30 per million tokens read from cache (10% of input). The 5-minute cache is free to keep alive; the 1-hour cache costs 2× the write price.

Concrete example. A RAG chatbot with:

  • 10,000-token system prompt + retrieved context (mostly stable across calls)
  • 200-token user message (always new)
  • 500-token response

Without caching, per call: 10,200 × $3/M = $0.0306 input + 500 × $15/M = $0.0075 output = $0.0381 per call.

With caching (one cache write, then hits):

  • Cache write call: 10,000 × $3.75/M + 200 × $3/M + 500 × $15/M = $0.0375 + $0.0006 + $0.0075 = $0.0456 (slightly more than no-cache for the first call).
  • Cache hit calls: 10,000 × $0.30/M + 200 × $3/M + 500 × $15/M = $0.003 + $0.0006 + $0.0075 = $0.0111 per call.

Break-even is at 2 calls within the cache TTL. After call 2, you save $0.027 per call (71% cheaper). At 1,000 calls per hour with the 1-hour TTL: $27 saved per hour, $648 per day, ~$237,000 per year on a single endpoint.

Caching does not pay off for endpoints with under 1,024 tokens of stable prefix or for one-shot workloads where the prompt never repeats.

What to cache (and what not to)

Cache anything stable across requests:

  • System prompts. The "you are X, follow these rules" preamble.
  • Tool definitions. JSON schemas for the tools the model can call.
  • RAG context — when you are retrieving the same documents across a conversation, or when many users see the same knowledge base.
  • Few-shot examples in the prompt.
  • Conversation history in long sessions (Anthropic supports incremental caching where each new turn extends the cache).

Do not cache:

  • User messages. They vary per call; caching them adds cost without saving anything.
  • Per-call dynamic data like the current timestamp, request ID, or anything that changes every time.
  • Outputs. Cached on the read side, not the write side.

The ordering rule: put the stable prefix first, the dynamic content last. The cache works on contiguous prefixes, so a system prompt that ends with Current time: ${now} is uncacheable — move that timestamp out of the system prompt into a user message or into the structured tool-use schema.

Anthropic: explicit cache_control

Anthropic uses an explicit cache_control marker on the content blocks you want cached. The marker tells the API "everything up to this point should be cached":

json
{
  "model": "claude-sonnet-4-6",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "{{long_stable_context}}",
          "cache_control": { "type": "ephemeral" }
        },
        {
          "type": "text",
          "text": "{{dynamic_user_message}}"
        }
      ]
    }
  ]
}

You can place up to 4 cache breakpoints in a single request, which lets you cache the system prompt, the tool definitions, and the RAG context independently. Different cache hits cost the same; the breakpoints just give you more flexibility in mixing stable and dynamic content.

The default TTL is 5 minutes, refreshed on each hit. For RAG endpoints where the context is reused over hours, set cache_control: { type: "ephemeral", ttl: "1h" } — the cost math changes but the break-even point drops to roughly 2 requests an hour for a 10KB prefix.

OpenAI: automatic caching above 1024 tokens

OpenAI's prompt caching is automatic — no marker needed. Any prompt of 1,024 tokens or longer that matches a recently-seen prefix gets a discount of roughly 50% on the cached portion. Cache writes are free (no extra cost over the base input price), and there is no TTL control — the cache lives for some provider-determined window (typically minutes, sometimes longer for very common prefixes).

The implication for prompt structure is the same as Anthropic: put stable content first, dynamic content last, keep the prefix above 1,024 tokens. The implication for cost: less savings per cached token than Anthropic, but no per-call ceremony.

Gemini: explicit context caching

Gemini Context Caching is explicit and uses a separate API: you upload the stable context to the cache, get back a cache ID, then reference that ID in subsequent requests. Cached tokens cost roughly 25% of base input.

python
from google import genai

client = genai.Client()
cached = client.caches.create(
    model="gemini-2.5-pro",
    config={"contents": ["{{long_stable_context}}"], "ttl": "3600s"},
)
response = client.models.generate_content(
    model="gemini-2.5-pro",
    contents=["{{dynamic_user_message}}"],
    config={"cached_content": cached.name},
)

The trade-off vs Anthropic: Gemini's explicit caching is better for long-lived, infrequently-changing context (a whole legal document, a manual). Anthropic's implicit caching with TTL is better for high-frequency reads with TTL alignment to your traffic pattern.

Code: JavaScript and Python with Anthropic SDK

JavaScript:

javascript
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();

const SYSTEM_PROMPT = `You are a helpful technical writer...
(thousands of tokens of stable instructions and examples)
`;

const RAG_CONTEXT = await fetchKnowledgeBase();

async function ask(userMessage) {
  return client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    system: [
      {
        type: "text",
        text: SYSTEM_PROMPT,
        cache_control: { type: "ephemeral" },
      },
    ],
    messages: [
      {
        role: "user",
        content: [
          {
            type: "text",
            text: RAG_CONTEXT,
            cache_control: { type: "ephemeral" },
          },
          { type: "text", text: userMessage },
        ],
      },
    ],
  });
}

Two cache breakpoints: one on the system prompt, one on the RAG context. Both refresh on hit. Total cached content can be tens of thousands of tokens.

Python:

python
from anthropic import Anthropic
client = Anthropic()

SYSTEM_PROMPT = """You are a helpful technical writer..."""
RAG_CONTEXT = fetch_knowledge_base()

def ask(user_message: str):
    return client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=[
            {"type": "text", "text": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}}
        ],
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": RAG_CONTEXT, "cache_control": {"type": "ephemeral"}},
                    {"type": "text", "text": user_message},
                ],
            }
        ],
    )

Check the usage field on the response: cache_creation_input_tokens tells you what was written, cache_read_input_tokens tells you what was read from cache. Watch these in your monitoring to confirm caching is actually firing.

The 5-minute vs 1-hour TTL trade-off

Anthropic offers two TTLs: 5-minute (default) and 1-hour.

  • 5-minute TTL: write cost is 1.25× input (no extra). Hit cost is 10% of input. Best for high-frequency endpoints where the same prefix is hit every few seconds — chat sessions, batch processing pipelines, tightly-scoped agents.
  • 1-hour TTL: write cost is 2× input. Hit cost is 10% of input. Best for medium-frequency endpoints where the prefix is hit a few times per hour — RAG over a slowly-changing knowledge base, daily-batch workflows, low-traffic chatbots.

The break-even between 5-minute and 1-hour caching is roughly 1 call every 12 minutes. Below that frequency, the 1-hour cache pays for itself. Above, the 5-minute cache is cheaper because the write cost is lower and you would refresh anyway.

Common pitfalls

  • Cache misses you cannot explain. Almost always a non-deterministic detail at the start of the prompt: timestamps, randomised IDs, user-specific personalisation. Check the prefix byte-for-byte across two consecutive calls.
  • Caching a prefix that contains the user message. If the user message changes per call, no two calls share that prefix and the cache writes accumulate without ever being read. Move dynamic content out of the cached region.
  • Forgetting to cache tool definitions. If your prompt has a long list of tools in the system prompt or tool-definitions section, that is one of the biggest savings opportunities — but only if the tool list is stable across calls. Define tools once at the top, before any dynamic content.
  • Cache writes counted as savings. A cache write is more expensive than the no-cache baseline. If your traffic pattern is "one call per user with a new prefix every time", caching adds cost rather than removing it. Only enable caching when you have at least 2 calls per TTL window using the same prefix.
  • Mixing models with the same cached content. Cache is per-model. Switching from claude-sonnet-4-6 to claude-opus-4-7 invalidates everything; the new model rewrites the cache from scratch.

What to do next

For the broader question of which Claude tier to send each prompt to:

For the prompt-engineering side that maximises cache hits and reduces token usage:

For the broader AI tools and skills landscape:

External references: the Anthropic prompt caching documentation is the canonical reference. OpenAI's automatic prompt caching announcement explains the 1,024-token threshold. Gemini context caching docs cover the explicit-cache pattern.

FAQ

TagsLLMAnthropicOpenAIPrompt CachingAPICost OptimizationAI
Share
Ishan Karunaratne

Ishan Karunaratne

Tech Architect · Software Engineer · AI/DevOps

Tech architect and software engineer with 20+ years across software, Linux systems, DevOps, and infrastructure — and a more recent focus on AI. Currently Chief Technology Officer at a tech startup in the healthcare space.

Keep reading

Related posts

Write LLM evals that catch regressions. Pick metrics (exact match, LLM-as-judge, embedding similarity), build a golden dataset, run on every PR, monitor trends.

How to Write LLM Evals That Catch Regressions

Write LLM evals that catch real regressions: pick the right metrics (exact match, LLM-as-judge, embedding similarity), build a golden dataset, run on every PR, and watch the trend over time.

Wire ElasticPress to WP_Query so WordPress queries hit Elasticsearch or OpenSearch instead of MySQL. Install, indexable post types, ep_integrate, wp-cli index, faceted aggregations, and when ES actually beats MySQL FULLTEXT.

How to Use ElasticPress with WP_Query

Wire ElasticPress to WP_Query so WordPress queries hit Elasticsearch (or OpenSearch) instead of MySQL. Covers installation, indexable post types, ep_integrate, the wp-cli index command, faceted search with aggregations, and when ES actually beats MySQL FULLTEXT.

Build an LLM agent with tool use. The agentic loop, tool-call formats on Anthropic / OpenAI / Gemini, JavaScript and Python code, common failure modes.

How to Build an LLM Agent with Tool Use

Build an LLM agent with tool use: the agentic loop, the tool-call format on Anthropic, OpenAI, and Gemini, runnable code in JavaScript and Python, plus the common failure modes.