Approx. 8 min read · 1,760 words
The AI bill that surprised half our clients last quarter
Three of our clients hit Claude or GPT bills they didn't budget for in Q1. Not because traffic spiked. Because their prompts got longer, their context windows fattened up, and nobody had turned on prompt caching. One was burning $4,200 a month on a chatbot that could have run on $1,300 with a five-line config change. The team had read the Anthropic docs, nodded, and moved on to "more important" tickets.
That gap is the story of 2026's quiet AI cost win. Prompt caching exists. Both Anthropic and OpenAI ship it as a default-off (or default-conservative) feature, and the libraries are mature.
And yet most teams we audit haven't touched it. They keep paying full price for tokens they're sending again and again, request after request, while the cost dashboard slowly bleeds.
If your AI bill is climbing faster than your usage, this is probably why. The good news is that fixing it usually takes a focused afternoon, not a re-architecture sprint.
What prompt caching actually does
Prompt caching lets you mark portions of a prompt as cacheable. The provider stores the processed version of those tokens for a short window, usually five minutes for an ephemeral cache, or longer with paid tiers. When the next request hits with the same cached portion, you skip the cost of reprocessing those tokens, and you also skip a noticeable chunk of latency.
Anthropic's cache reads cost roughly 10% of standard input pricing. OpenAI's automatic input-token caching, rolled out in late 2024 and refined since, gives you a 50% discount on cached input tokens with no config required (it kicks in once a prompt crosses 1,024 tokens). Both options give you cheaper repeat calls when the front of your prompt is stable.
The "stable front" matters. Caching only works if the same tokens appear at the same position across requests. System instructions, tool schemas, retrieval-augmented context, and large documents you're answering questions about: these are all stable.
User questions at the end of the prompt are not stable, but they don't need to be cached. They're tiny.
Why most teams skip it, and the numbers we've seen in production
Honestly, it's a mix of three things.
First, AI cost still feels small until it isn't. A team running 200 prompts a day at $0.01 each doesn't notice $60 a month. The same team three quarters later, with longer context and 10x the volume, is suddenly paying $4,000 and panicking. The cost curve isn't linear, and caching helps most exactly when bills start to bite.
Second, the docs put the feature behind a flag and a header. Anthropic asks you to add a cache_control block on the message you want cached. OpenAI's caching is automatic but only kicks in when prompts exceed 1,024 tokens. Either way you have to read the docs and ship a small change, then verify it actually fires. For an under-staffed product team, this lives at the bottom of the backlog forever.
Third, and this is the contrarian piece: most engineering blogs covering AI infrastructure obsess over fine-tuning, RAG architecture, and vector databases. It gets one paragraph in a 3,000-word article. It feels boring. It's also where the savings actually are for the 80% of production workloads that don't need exotic infrastructure.
Across six AI-heavy client projects we've audited or built since late 2025, here's what turning on caching changed:
| Workload | Cache hit rate | Bill change |
|---|---|---|
| Customer support chatbot (long system prompt) | 72% | -58% |
| Document Q&A (RAG with 30-page PDFs) | 61% | -49% |
| Code review agent (full repo context) | 84% | -66% |
| Sales email drafter (short prompts) | 22% | -9% |
| Healthcare triage assistant (large guidelines) | 78% | -61% |
| Marketing copy generator (variable inputs) | 14% | -4% |
The pattern is simple. The more stable context you carry, the more caching helps. Workloads where every prompt is bespoke barely benefit. Workloads with big shared system prompts or large reference documents win hard. We covered the broader cost picture in our 2026 AI development cost breakdown, but prompt caching is the single biggest knob most teams haven't turned.
Where teams get prompt caching wrong
We've watched smart teams burn the savings back. Four patterns to avoid.
Putting the cacheable content at the end. The prompt has to be cached from the front. If your system instructions sit after the user's question, you cache nothing. Move the stable parts up. We've seen this on at least two production codebases where the original prompt was assembled in random order from a config object.
Letting the cache window expire by mistake. Anthropic's ephemeral cache lives 5 minutes. If your traffic is spiky, you'll cold-start every cache often. We solved this for one fintech client by adding a low-priority warming call every 4 minutes that keeps the cache hot for the burst that follows. The warming call ran on an internal Lambda, billed maybe $3 a month, and saved them roughly $900.
Caching dynamic context. If you're stuffing the user's most recent message into the cached block, you're invalidating the cache every request. Rule of thumb: cache anything older than the current turn. The boundary line should be obvious in your prompt template; if it isn't, refactor before you cache.
Trusting hit rates without measuring. Anthropic returns cache_creation_input_tokens and cache_read_input_tokens in the response. Log them. We've seen teams "enable" caching and never check whether it actually fires, then discover three months later that a prompt template change broke the cache prefix. A single trailing whitespace tweak can turn an 80% hit rate into 0% overnight.
How SMEs, startups, and engineering leaders should ship this
If you're an SME running a single AI feature, this is the lowest-effort cost win you have. A senior engineer can implement and verify it in under a day. The savings compound monthly with no further work, and the only ongoing maintenance is making sure prompt-template changes don't quietly break the cache prefix.
If you're a startup founder watching runway, this is mandatory before your seed-to-Series-A traffic ramp. We've seen seed-stage teams quote 18-month runway, then watch AI bills eat 4 months of it after a Product Hunt launch. It shouldn't be your first feature, but it should ship before your first big launch. The marginal cost of getting it right at the start is hours; the cost of retrofitting at scale is weeks.
If you're a CTO or VP of Engineering, this is also a vendor-risk question. Anthropic and OpenAI both offer caching, but the implementations and pricing differ. If you're locked to one provider, switching costs include re-engineering caching. We've helped clients build provider-agnostic AI API layers exactly so this stays a flag flip, not a rewrite. Treat the caching configuration as part of your AI vendor abstraction, not as provider-specific code scattered through controllers.
And if you're a developer, the practical move is small: make cache hit rate a first-class metric in your dashboards, alongside latency and error rate. OpenAI's caching docs show how to read the breakdown from the response. Once it's a metric, the team will optimize for it. Things you don't measure don't improve.
For new AI projects we ship at our AI engineering practice, prompt caching goes on the day-one checklist, not the optimization sprint. The cost difference compounds, and rebuilding the prompt structure later is harder than starting with it cached-ready. For audits of existing AI features, we usually find one or two structural issues: dynamic content placed at the front of the prompt, system instructions split across multiple messages, or a tool definition whose hash changes every deploy. Each fixable in a focused half-day. We've seen teams cut a $5,000 monthly bill to $1,800 with three small refactors and a cache_control header.
If you want a starting point, look at three things this week:
- Pull your last 1,000 production calls and find the median input token count. If it's over 2,000, caching almost certainly helps.
- Check whether your system prompt sits at the start of every request, identical character for character. Whitespace and timestamps in the system prompt break caching invisibly.
- Read the cache-related fields in your provider's response. If you're not parsing them, you're flying blind.
Frequently Asked Questions
Is prompt caching free?
Cache reads are heavily discounted but not free. Typically 10% of input price on Anthropic, 50% on OpenAI. Cache writes (the first call) cost slightly more than uncached input. The economics work out positive when your cache hit rate is above roughly 30%, which is easy to clear on most production workloads with stable system prompts.
How long does the cache last?
Anthropic's ephemeral cache lasts 5 minutes by default, with longer durations available on paid tiers. OpenAI's automatic cache typically lasts 5 to 10 minutes, refreshed on each hit. Plan for cold starts after low-traffic periods, or warm the cache with low-priority pings if your traffic is spiky.
Does prompt caching affect output quality?
No. Caching stores the processed input state, not the output. The model still generates fresh tokens for every request. Outputs are identical with or without caching, given the same final prompt and sampling parameters.
Can I cache only part of a prompt?
Yes. Both Anthropic and OpenAI cache from the start of the prompt up to the marked breakpoint. Stable system instructions, tool schemas, and reference documents go before the breakpoint; user-specific content goes after. Most production prompts have a natural seam where this split works cleanly.
What if I'm using a wrapper like LangChain or a managed service?
Most modern wrappers expose cache control as a parameter. Check your wrapper's release notes. Support for this feature shipped between mid-2024 and early 2025 across the major libraries. If your wrapper hides it, the savings are hidden too, and that's a reason to consider switching to one that exposes the underlying provider features.
Final take
Prompt caching is the unglamorous part of running AI in production. It doesn't make a demo look better and won't impress a board deck. It will quietly cut 30 to 60 percent off your inference bill on most workloads, and it does so with a configuration change, not a re-architecture. We'd argue this is the single most underused tool in the 2026 AI engineering toolkit, and the longer your team waits, the more compounded waste you're sitting on.
If you're not sure whether your AI features are leaking money this way, that's worth a conversation. Book a discovery call with our team and we'll review your prompt structure with you. If you want broader context on where the stack is moving, our take on how Claude is reshaping IT delivery covers the bigger picture.