50+ tokens per second on a desktop: running LLMs on the NVIDIA DGX Spark

March 10, 2026 · 5 min read

TL;DR: We got a 30-billion-parameter LLM running at 51-54 tokens/sec on the NVIDIA DGX Spark by combining Mixture-of-Experts architecture, FP8 quantization, and a community Docker image that fixes Blackwell-specific issues. Here's what we learned.


The NVIDIA DGX Spark is an interesting machine. It packs a Blackwell GB10 GPU with 128GB of unified LPDDR5X memory into a desktop form factor. For XRPL Commons, we wanted local LLM inference for our development workflow, fast enough to be usable, private enough to run on-premises, and simple enough to replicate across machines.

Getting there was not straightforward. This post documents the journey from 3.7 tok/s (unusable) to 54 tok/s (excellent), and the key technical decisions that made the difference.

The Hardware

The DGX Spark ships with:

  • NVIDIA GB10 Blackwell GPU (SM 12.1)
  • 128GB unified LPDDR5X at 273 GB/s bandwidth
  • ARM Grace CPU (aarch64), 10 cores
  • 3.7TB NVMe storage
  • DGX OS (Ubuntu 24.04)

128GB of unified memory means you can fit very large models. But there's a catch.

The Bandwidth Wall

LLM inference is memory-bandwidth-bound. During autoregressive decoding, each token requires reading every active weight from memory once. At 273 GB/s, the math is simple:

  • Dense 32B model (bf16): 64GB of weights / 273 GB/s = ~234ms per token = ~4 tok/s
  • Dense 8B model (bf16): 16GB / 273 GB/s = ~59ms = ~17 tok/s

No amount of compute optimization changes this. The Spark can hold a 70B model in FP8, but it will generate tokens at walking pace. The memory is large but not fast.

We learned this the hard way. Our first attempt, Qwen3-32B at bf16, produced 3.7 tokens per second. Qwen3-8B was better at 13.1 tok/s, but still below the threshold for interactive use.

The MoE Breakthrough

The solution is Mixture-of-Experts (MoE) models. An MoE model has many total parameters but only activates a fraction per token. Qwen3-30B-A3B has 30 billion parameters but only 3 billion active ones, the router activates a small subset of experts per token, leaving the rest idle in memory.

The bandwidth math changes completely:

  • MoE 30B, 3B active (bf16): ~6GB active weights / 273 GB/s = ~22ms = ~45 tok/s theoretical
  • MoE 30B, 3B active (FP8): ~3GB active weights / 273 GB/s = ~11ms = ~90 tok/s theoretical

You get the quality of a 30B model at the speed of a 3B model.

The Software Stack Problem

The DGX Spark's Blackwell GPU (SM 12.1) is new enough that upstream tooling doesn't fully support it:

  • Flash Attention 2 crashes with a PTX toolchain error
  • vLLM's MOE CUTLASS kernels don't include SM 12.1 in their architecture intersection lists
  • PyTorch officially supports up to SM 12.0
  • CUDA graphs, critical for throughput, simply don't work with a standard vLLM build

We spent considerable time on a manual vLLM build from source: patching CMakeLists.txt, building Triton from a specific commit, working around setuptools license field validation bugs, pinning transformers below 5.0 to avoid tokenizer breakage. The manual build worked but required --enforce-eager mode (no CUDA graphs), capping throughput at ~30 tok/s.

The Avarok Docker Image

The Avarok dgx-vllm project solves all of this in a single Docker image. It includes:

  • A patched vLLM v0.16.0rc2 with SM 12.1 support
  • Software E2M1 conversion for the missing NVFP4 PTX instruction
  • Custom CUTLASS kernels for the GB10
  • Working CUDA graphs and Flash Attention

One docker pull and one docker run command replaces hours of manual compilation.

Results

ModelQuantizationEngineTokens/sec
Qwen3-32B (dense)bf16Manual vLLM3.7
Qwen3-8B (dense)bf16Manual vLLM13.1
Qwen3-30B-A3B (MoE)bf16Manual vLLM28.6
Qwen3-30B-A3B (MoE)bf16Avarok Docker30.3
Qwen3-30B-A3B (MoE)FP8Avarok Docker51-54

The winning combination: MoE architecture + FP8 quantization + Avarok Docker with CUDA graphs.

We've deployed this setup across two DGX Sparks with consistent results. The FP8 model uses ~110GB of the 119GB available memory, leaving minimal headroom, but the throughput is worth it.

The Setup

The final deployment is remarkably simple:

docker pull avarok/dgx-vllm-nvfp4-kernel:v22

docker run -d \
  --name vllm \
  --gpus all \
  --shm-size=16g \
  --restart unless-stopped \
  -p 8000:8888 \
  -v /home/$USER/.cache/huggingface:/root/.cache/huggingface \
  -e MODEL=Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 \
  -e PORT=8888 \
  -e GPU_MEMORY_UTIL=0.85 \
  -e MAX_MODEL_LEN=32768 \
  avarok/dgx-vllm-nvfp4-kernel:v22 serve

First boot takes 10-20 minutes (model download + CUDA graph capture). After that, it auto-starts on reboot and serves an OpenAI-compatible API at http://localhost:8000/v1.

Lessons Learned

1. Understand your bottleneck. The Spark's 273 GB/s bandwidth determines everything. Once we understood this, the model selection became obvious, MoE with minimal active parameters.

2. Don't build from source if you don't have to. Our manual vLLM build took hours of debugging across multiple sessions. The Avarok Docker image does everything better and in one command.

3. FP8 quantization is nearly free. The jump from bf16 to FP8 nearly doubled throughput (30.3 to 51 tok/s on the same engine) with no perceptible quality difference for our use cases.

4. Stop Ollama first. On one Spark, Ollama was consuming ~100GB of memory when we tried to install vLLM. The build process OOM-killed the machine. Disable competing inference servers before starting.

5. Kernel updates break NVIDIA drivers. DGX OS auto-updates the kernel, but the NVIDIA modules don't follow automatically. After a reboot, nvidia-smi may fail. The fix is sudo apt install linux-modules-nvidia-580-open-$(uname -r).

6. Community Docker images can be ahead of official ones. The Avarok image runs vLLM v0.16.0rc2 with Blackwell fixes, months ahead of where NVIDIA's own builds are.

What's Next

Community results suggest AWQ 4-bit quantization can push the same model to 82 tok/s. NVIDIA's own NVFP4-quantized models (like Qwen3-Next-80B-A3B) report even better quality at ~67 tok/s average. As toolchain support matures, these numbers should keep improving.

For now, 51-54 tok/s with a 30B-parameter MoE model is fast enough for interactive coding assistance, document drafting, and general-purpose use, all running locally on a desktop machine.

Resources

Try It Yourself

If you have a DGX Spark, the Docker approach takes about 20 minutes from zero to serving. Pull the Avarok image, run the container with the command above, and you're up. Reach out to us at XRPL Commons if you want the full setup guide with troubleshooting details.