Anton, chapter 8: Local LLM optimization, NVFP4 Gemma on DGX Spark

The morning starts with everything broken. Anthropic returns 400 with "credit balance is too low" on every request, and because the sunny model group has no fallbacks configured, the error propagates straight back through LiteLLM as "I couldn't reach the language model. Something may be misconfigured." The heartbeat stops, scheduled jobs fail silently, every interactive chain dies on the first token. The actual cause is billing (a card issue, fixable in two clicks), but the fact that one provider's billing hiccup takes the whole assistant down is the real bug. The local Gemma is sitting on the same box, idle, ready to serve. It just isn't wired in as a fallback.

The plan writes itself: chain every paid provider down to local for survivability, then fix whatever's wrong with the local path so the chain actually works, then use the disruption as cover to do the LLM upgrade I've had in the back of my mind for two weeks. Three things, in that order, because the survivability fix has to land before I touch the running container.

A fallback chain

The first commit is the LiteLLM config. Eight model groups, zero fallback entries: a config shape that's been sitting there since the early days when there was no local model worth falling back to. I add explicit chains so every paid group degrades to gustav (the local Gemma), with the heavier groups going through an intermediate before the local stop. LiteLLM mounts the config as a volume, so a restart is needed for it to pick up. Five minutes of work. The latent gap I'd been carrying for months, closed.

First fallback test fails. Anthropic 400, LiteLLM tries gustav, vLLM responds with its own 400: "auto" tool choice requires --enable-auto-tool-choice and --tool-call-parser to be set. The container has never been launched with those flags. The local path was never exercised under tool-call traffic, so the missing flags were latent the whole time. This is the small lesson the morning hands me: a fallback that isn't routinely exercised isn't really a fallback. Schema drift hides in the paths nobody runs.

Fixing it is two flags and a parser name I don't know. Rather than hunt through release notes, I list the tool_parsers directory inside the running container. A gemma4_tool_parser.py and gemma4_reasoning_parser.py are sitting right there. Grep the container, not the docs: faster every time. I add --enable-auto-tool-choice, --tool-call-parser gemma4, --reasoning-parser gemma4. Tool-call smoke test through LiteLLM returns a proper structured tool_calls object. Fallback chain is functional end to end.

Now the system is at parity with where it was supposed to be all along. This is the moment I want to upgrade. And this is the moment I almost skip the most important thing.

Baseline before change

I'm about to start changing flags when I catch myself: I have no baseline. No number to compare against. If I jump straight to the upgrade and it gets faster, I won't know by how much; if it gets slower, I might not even notice. I run the benchmark first. Fixed prompt, 200 words of Paris history, 512 max tokens, temperature zero, three runs. 23.4 tok/s, dead steady across runs. That's the number I'm trying to beat. Benchmark first, change second, every time. The temptation to skip this step is strong specifically because the change feels obvious. That's exactly when discipline matters.

Then I read before I touch. Three things I confirm via web sources before staging anything. First, runtime FP8 quantization of Gemma 4 MoE is broken upstream; passing --quantization fp8 would crash the container on the fused MoE layer loader. There's an open issue tracking it. Off the table. Second, an NVFP4-quantized Gemma 4 checkpoint someone had published is up on HuggingFace, 16.5 GB across three shards, ready to pull. Third, the checkpoint requires a patched gemma4.py because of another open vLLM issue: the built-in expert_params_mapping doesn't handle NVFP4 scale key suffixes. The patch ships alongside the model weights as a sibling file. The upstream fix isn't merged yet, so the bind mount is necessary. There's also a published benchmark on the same hardware showing 52 tok/s as the achievable ceiling with the right flag set. That's my target.

Staging the upgrade

Staging happens with the BF16 container still serving. I pull the new vLLM image, snapshot-download the NVFP4 model (about three minutes), copy the patched gemma4.py out of the model directory to a host path I can bind-mount, and rewrite the launch script with the NVFP4 flags and a commented BF16 rollback block sitting right underneath. Stage everything before the disruption window: when the swap actually happens, it's just a container recreate, not a fifteen-minute scramble.

The flags that matter: --quantization modelopt to pick up the NVFP4 weights, --moe-backend marlin because the GB10's SM121 lacks native FP4 compute and MARLIN W4A16 is the software-emulated path that actually runs, --max-model-len 131072 for the full 128K native context, --gpu-memory-utilization 0.85, --max-num-seqs 16 sized to actual concurrency rather than a wishful default. The served model name stays google/gemma-4-26B-A4B-it so LiteLLM's config doesn't need a single edit. The bind mount overlays the patched model file onto the path vLLM loads.

Container swap takes about ninety seconds end to end with a warm disk cache. The log line I'm watching for arrives: Using 'MARLIN' NvFp4 MoE backend out of potential backends. MARLIN is selected, the patched loader is in play, the model is up.

Same benchmark, three runs: 43.5 tok/s (49.0, 37.5, 44.1). 1.86× over baseline. Weight memory drops from roughly 52 GB to 16.5 GB, a 68% reduction. With the freed memory the KV cache budget goes from ~53 GB to ~82 GB at the same 0.85 utilization, which is what lets the max context go from 32K to 128K, a clean 4× without changing anything else. Tool calling still works.

The variance is higher than BF16 (the runs spread from 37 to 49 tok/s) because MARLIN is software-emulated FP4 on this hardware, not native compute. The published benchmark target is 52 tok/s and I'm landing at 43.5; the gap is most likely torch.compile warmup and prefix cache state across cold runs, not the flags. Close enough. The hardware ceiling on this specific path is what it is until the silicon catches up or the backend changes.

	Before	After
Single-request tok/s	23.4	43.5
Weight memory	~52 GB	~16.5 GB
Max context	32,768	131,072

I leave the BF16 rollback block sitting in the launch script, commented. Pasting it into a shell reverts the config in about ninety seconds. The NVFP4 model and the patched file stay on disk; rollback is a container recreate, not a data restore.

Survivability is a feature

Sitting with the result, the morning's three lessons are the ones that compound. Fallback paths that aren't routinely exercised aren't really fallbacks: the missing tool-call-parser flag had been latent for months because nothing ever fell back. Baseline before change, every time, especially when the change feels obvious. And when you're about to mess with a running service, stage everything you can while it's still up; ninety seconds of downtime instead of fifteen minutes is the difference between a deploy and an incident.

There are open lines from here. Speculative decoding is on the table: the smaller Gemma 4 variants share the vocab with the 26B and are valid draft models, with a budget of two to four GB for another 1.5× on single-request latency that would compound with NVFP4. A second small-model container as a router (Qwen3-8B or similar) could move heartbeat and classification traffic off Gemma entirely; the freed GB handles it fine and the latency distribution wins more than further Gemma tuning would. And the upstream patch for the NVFP4 expert mapping is worth checking on periodically; when it lands in an image tag, the bind mount goes away.

For now the picture is clean. The local Gemma serves at 1.86× the throughput on a third of the weight memory, with a context window that can swallow whole documents instead of choking on them, and a fallback chain that means any provider going down (billing, rate limits, an API blip) routes traffic to the box on my desk instead of taking the assistant offline. The morning started with everything broken. The evening ends with a system that's harder to break than it was before any of this happened.