Anton, chapter 10: Making Gemma fly

I left a line open in the spring: speculative decoding, maybe a 1.5x on the local model, compounding with the quantization work I'd just done. Gemma 4 has landed since, the model on the Spark serves in the high forties of tokens a second, and I keep telling myself that's the ceiling on this hardware. The question I've actually been chewing on is sharper than that: is it the ceiling, or is it just the ceiling of what I bothered to try? I decide to find out properly, with numbers, and not stop at the first comfortable answer.

The quantization dead end

The obvious lever is a better quantization, and there's a fresh one to chase: a quantization-aware-training release, weights trained to survive 4-bit. The theory is seductive, full quality at a quarter of the memory. I fan out a small army of research agents and they come back with the same verdict from five angles: it does nothing for us. The Spark is memory-bandwidth-bound, and any 4-bit format moves the same bytes per token; the number that decides everything is bandwidth, and no checkpoint changes it. Worse, the format the box already runs gets its storage win but not its compute win, because this silicon has no native 4-bit math. I'm carrying the cost of the format (a patched model file, a software-emulated kernel, an image that loads it only if you hold your mouth right) and collecting none of the payoff. Quantization was never the lever.

The spec sheet is not the benchmark

So I look at the machines I already own. My work Mac and an idle laptop both have more memory bandwidth than the Spark, and decode is bandwidth-bound, so on paper they should win. They lose, badly: the same model runs slower on twice the bandwidth, because Apple's mixture-of-experts kernels pull a third of what the datacenter stack does. Bandwidth sets the ceiling; the kernel decides where you land under it. That detour cost me a clean hypothesis and bought a real rule: the spec sheet is not the benchmark.

Multi-token prediction

The lever turns out to be the one I'd left open. Multi-token prediction: a tiny drafter guesses the next few tokens, the big model verifies them in a single pass, and on a bandwidth-bound box that is the only trick that beats the wall, because you pay for one weight read and get several tokens back. A community write-up claims triple the baseline on this exact hardware, and I want it. I match the recipe to the letter and land at half the headline. The gap is not a missing flag; I run their command verbatim and my baseline matches theirs to the decimal. The gap is what "tokens per second" even means. Their number is a harness measuring pure decode with the prefill stripped away; mine is what a caller actually waits for, wall clock, everything included. Both are true. They measure different things. The honest figure, the one a person feels, is a clean 1.4x to 1.9x depending on how predictable the text is, and I write that one down instead of the prettier one.

Two orthogonal levers

Then I test it the way Anton is really used: the actual router prompt, a real conversation, sixteen of them at once. The number that matters isn't single-stream at all. When someone is mid-conversation the whole prefix is already cached, and the answer comes back in half a second instead of five, ten times faster on the most common path there is. The win was never a single thing. It was two orthogonal levers, speculation and caching, neither of which has anything to do with the quantization I'd spent days circling.

Trying to break it

Last, I try to break it. Five hundred requests at once, and it just queues, never dropping one, the chip barely warm; the old warning about thermal shutdowns at high concurrency refuses to reproduce. So this is where it lands: the local model answers a live conversation in half a second, holds north of a thousand tokens a second across a crowd, and fails over to a second box if the first ever stumbles. The question I started with has an answer. The ceiling I'd accepted was the ceiling of one idea, and most of the climb past it was just measuring honestly enough to see where the floor actually was.