Why Faster LLM Inference Breaks Your Intuition

May 18, 2026

The GPU isn't compute-bound when your LLM generates text. It's waiting.

That sentence sounds wrong if you've watched a modern H100 get purchased for $30,000 and assumed the bottleneck is raw matrix-multiplication power. But the math is the cheap part. The expensive part — in time, in energy, in tokens per second — is moving model weights from high-bandwidth memory (HBM) into the GPU's on-chip registers. That's called the memory-bandwidth bottleneck, and it's why LLM inference is slower than it looks like it should be.

Speculative decoding is the fix that doesn't make sense until it clicks. And when it clicks, it reframes how you think about building inference pipelines.

Why Autoregressive Generation Has a Hidden Ceiling

A language model generates one token at a time. This is called autoregressive decoding — each token depends on all previous tokens, so you can't parallelize across the output sequence. You run one forward pass, get one token, append it to the context, run another forward pass, get another token. Repeat until you hit the stop sequence or the length limit.

Every forward pass loads the full model weights from HBM. For a 70-billion-parameter model in float16, that's roughly 140 GB of memory being streamed through the GPU per pass. Modern H100 GPUs have ~3.35 TB/s of HBM bandwidth. At that throughput, just loading the weights takes time — time during which the actual compute cores are waiting.

The result: for typical LLM inference batch sizes, GPU utilization looks efficient on the memory side, but the compute cores are underused. You're memory-bandwidth-limited, not compute-limited. Scaling to a bigger GPU with more FLOPS doesn't fix this if memory bandwidth stays proportional.

This is the foundational problem that speculative decoding attacks.

What Speculative Decoding Actually Does

The mechanism is simpler than the name suggests. You run two models instead of one:

A draft model — small, fast, cheap to run. Something like a 7B parameter model alongside your 70B main model.
A target model — the large, high-quality model you actually care about.

Here's the sequence:

The draft model generates K tokens autoregressively (e.g., K = 5).
You run a single forward pass of the target model — but this time, you run it on the full context including all K draft tokens simultaneously. Because transformer forward passes are parallel across the sequence, this costs roughly the same as one normal token-generation step.
The target model scores each draft token. If a draft token matches what the target model would have produced, you accept it. If not, you reject it from that point forward and use the target model's correction.

The key insight: you can verify K tokens in parallel with one target model pass. If the draft model has even modest accuracy — say 70-80% token acceptance rate — you're getting 3-4 tokens for the cost of 1 target model pass. That's 3-4x the effective throughput without changing model quality.

The target model's output distribution is preserved. The ICLR 2026 paper "Speculative Speculative Decoding" and UC Berkeley's technical report on speculative inference both confirm this — when you use the right rejection sampling scheme, speculative decoding is provably equivalent to sampling from the target model directly. You get speed without quality compromise.

The Counterintuitive Math

Running two models seems like it should be slower. You're doing more work, not less. This is where the memory-bandwidth framing becomes essential.

The draft model is so much smaller that its per-token cost is negligible. If the draft model is 7B parameters and the target is 70B, the draft costs roughly 1/10th the compute and memory bandwidth. So running 5 draft tokens costs about the same as 0.5 target model passes.

Meanwhile, verifying 5 tokens in one target model pass costs roughly the same as generating 1 token, because the parallel attention computation over K tokens isn't K times as expensive — it's bounded by the memory load of the weights, which you'd pay once either way.

Net effect: you pay ~1.5x per target forward pass and get 3-4 tokens instead of 1. The throughput multiplier is real and consistently measured at 2-3x in production deployments.

Why Draft Model Selection Is a First-Class Design Decision

Most teams implementing speculative decoding treat it as a serving-layer optimization — swap in the right inference library, tune K, done. That's underusing the concept.

The draft model acceptance rate drives everything. A draft model that produces 90% acceptable tokens is delivering you 4-5x speedup. A draft model at 60% acceptance is delivering 1.5-2x. The difference between those outcomes is which draft model you pick, and picking well requires understanding your query distribution.

For a coding assistant, a code-specialized small model will dramatically outperform a general-purpose one as your draft. For a customer service application where most responses follow predictable patterns, you can tune a small model specifically on your response distribution and get near-optimal acceptance rates.

The frameworks diverge here. vLLM, TensorRT-LLM, and SGLang all implement speculative decoding differently, with different default draft model integrations. This isn't a detail — the choice of draft architecture relative to your serving framework is where real-world speedups either materialize or don't.

What This Changes About Inference Architecture

The practical consequence: token throughput is no longer only a function of hardware. It's a function of how well your draft model predicts your target model's outputs on your specific traffic.

This creates a new optimization surface that didn't exist in the pure autoregressive world. You can:

Fine-tune draft models on domain-specific data to improve acceptance rate on your query distribution
Tune K (the number of draft tokens per pass) to balance latency and throughput for your latency budget
Run multiple speculative heads in parallel on some architectures, sampling from different draft distributions

It also creates a new failure mode: if your query distribution drifts (e.g., you release a new product feature and users start asking about it), your draft model's acceptance rate will drop and your effective throughput will degrade. This is invisible to standard monitoring — GPU utilization stays the same, tokens per second drops, and the cause isn't obvious unless you're tracking speculative acceptance rate explicitly.

This connects to a broader pattern in production AI infrastructure — silent degradation is often worse than loud failure, because you don't know to look for it until something downstream breaks. If you're building on RAG pipelines, you've seen this with embedding drift; speculative decoding has an analogous version with draft model drift. Penn covered the RAG embedding drift problem — the monitoring mindset transfers.

The Next Bottleneck Is Already Here

Speculative decoding addresses the memory-bandwidth bottleneck for a specific regime: inference at moderate batch sizes where the draft model can run efficiently alongside the target model.

But as models get larger and verification itself becomes expensive — particularly with extended thinking models that run many forward passes to produce a single output — the calculus shifts. Verification latency starts to eat into the speculative gains.

The Berkeley research group identifies this as the open frontier: how do you speculate over chains of reasoning, not just individual tokens? The draft model needs to not just predict the next token but the next reasoning step. That requires draft models that can understand and approximate reasoning structure, which pushes into distillation territory rather than simple parameter-count reduction.

For now, if you're running LLM inference at scale and haven't evaluated speculative decoding, the case is straightforward: 2-3x throughput at equivalent quality, with the caveat that you need to pick your draft model thoughtfully and monitor acceptance rate as a first-class metric.

The GPU was never doing what you thought it was doing. Once you see that, the rest follows.

Photo by Brett Sayles via Pexels