Prompt Caching Cut Our LLM Bill by 60%. Most Teams Still Don't Know It Exists.

Cover Image for Prompt Caching Cut Our LLM Bill by 60%. Most Teams Still Don't Know It Exists.

Your LLM API bill doubled last month. Your team's response: count the tokens more carefully.

That's like watching your water bill spike and grabbing a smaller glass.

The real cost driver in production LLM systems isn't how many tokens you use. It's whether the infrastructure has to recompute the same context from scratch every single time — or whether it can skip the work it already did. This is KV cache. And for most teams running LLMs in production, it's the thing nobody is watching.

The Metric You're Optimizing Isn't the One Burning Your Budget

Every LLM team I've seen tracks token count as the primary cost lever. Makes sense — the pricing page shows input/output tokens, so that's what gets measured. But token count is a billing abstraction. Underneath it, the actual cost is compute.

When a model processes a 50K-token context window — your system prompt, conversation history, retrieved documents — it performs a specific calculation: attention across every token pair in that sequence. That attention matrix grows quadratically with sequence length. A 50K-token context doesn't cost 10x more to compute than a 5K-token one. It costs closer to 100x more.

The mechanism is the key-value (KV) cache. Every transformer model builds this structure during inference — a memory object that stores intermediate computations for each token in the sequence. When you send a new request with the same prefix (same system prompt, same retrieved document), the model has two options: reuse the cached computations from last time, or recompute everything from scratch.

Without prompt caching: recompute from scratch, every request.

With prompt caching: reuse what's already computed, pay only for the new tokens.

Anthropic's pricing makes the delta concrete: cached input tokens cost roughly 90% less than uncached. For requests where a 30K-token system prompt precedes a 200-token user query, the economics change completely. You're paying full price on 0.7% of the work.

What ProjectDiscovery Actually Found

ProjectDiscovery builds security tools that process large code repositories inside their LLM pipeline. Their context windows ran 40K–80K tokens per request, with system prompts that stayed identical across calls.

In early 2026, they published the results of restructuring their prompts around caching: 59% cost reduction. Not on one type of request — across their total LLM spend.

The change wasn't a model swap, a compression trick, or a different provider. It was recognizing that their prompts had a stable prefix (system prompt and retrieved context) and a variable suffix (the user query and scan target). By moving stable parts to the front and marking the cache breakpoint explicitly, every subsequent request could reuse the expensive computation from the previous one.

The math isn't exotic. If your total input tokens split 80% stable prefix / 20% variable suffix, and the prefix is eligible for caching after the first request, you're paying full price on 20% of input tokens instead of 100%. On a $5,000/month bill, that's $3,000 staying in your account. On a $50,000/month bill, you've just funded two engineers.

Where Teams Get the Implementation Wrong

The pattern breaks in three predictable ways.

The prefix isn't actually stable. Prompt caching works by matching prefixes exactly. If your system prompt includes a timestamp, a request ID, or anything that changes per-call, the cache misses every time. Teams that add "Today is " at the top of their system prompt — standard practice for temporal grounding — are invalidating the cache on every request. Fix: move dynamic elements to the suffix. Static instructions first. Variable context (date, user details, retrieved docs) after.

The context is too short to benefit. Anthropic requires a minimum of 1,024 tokens for cache eligibility. If your prompts run shorter than this, caching doesn't apply. The threshold exists because the cost of storing a cache entry for a short prompt exceeds the savings from reusing it. If you're building with short prompts, the right lever isn't caching — it's batching and request volume reduction.

Teams skip the cache_control marker. Prompt caching isn't automatic on most providers. On the Anthropic API, you have to explicitly mark the cache breakpoint: "cache_control": {"type": "ephemeral"} at the last content block to be cached. Teams that read the pricing page but not the implementation guide miss this step entirely, implement nothing, and wonder why costs haven't changed.

The Semantic Caching Layer

Prompt caching handles exact-match prefix reuse. Semantic caching handles approximate reuse — queries that are semantically similar but textually different.

The approach: before sending a new query to the LLM, check a vector database for previous queries with high semantic similarity. If the previous query was "summarize the Q3 report" and the new one is "give me a summary of Q3 results," they'll land in the same embedding neighborhood. Return the cached response. Skip the LLM call entirely.

For customer-facing applications with recurring question patterns, this layer can eliminate 30–40% of LLM calls outright. The implementation adds complexity — you need an embeddings pipeline and a vector store — but for high-traffic systems the tradeoff is straightforward. Redis with cosine similarity is the standard approach. Set a threshold around 0.95 to start (conservative; catches identical phrasings), monitor hit rates, and widen to 0.85 once you're confident the semantic space is well-distributed.

This is meaningfully different from KV caching. Prompt caching reduces compute cost per call. Semantic caching reduces the number of calls. Both levers apply to different parts of the cost structure. Most teams don't implement either; the ones doing this systematically use both.

The Memory Ceiling That Appears Later

There's a third cost that shows up when you scale properly, and teams don't plan for it: cache storage.

KV cache is expensive to store. At the hardware level, a single 128K-token context stored in KV cache consumes roughly 40GB of GPU memory. Not 40GB per user — 40GB per active context. For a system with 1,000 concurrent users each maintaining a long session context, memory becomes the constraint before compute does.

This is why production caching strategies aren't just about the cache_control parameter. They're about cache TTL management (how long to keep cached entries before expiring them), context window compression (can you summarize older conversation turns to shrink the prefix?), and session architecture (when to start a fresh context versus extending an existing one).

If you implement prompt caching, watch your cache hit rate, not just your bill. A 70%+ hit rate means your prompt structure is working. A 20% hit rate means your prefixes are varying more than you think — debug the construction logic before concluding caching doesn't apply to your use case.

The metric you want is cache hit rate. The metric most teams track is token count. Your LLM's context window is already constrained in ways that matter for quality — the cost architecture is a separate constraint that runs in parallel.

Token count is a proxy. Hit rate is the signal.


Photo by Daniil Komov via Pexels