
Editorial polish — DiffusionGemma
{"title": "Google's DiffusionGemma hits 1,000 tokens/sec by generating 256-token blocks in parallel", "summary": "Google DeepMind's DiffusionGemma, a 26B MoE model, generates 256-token blocks simultaneously rather than sequentially, reaching 1,000+ tokens/sec on an H100 — a 4× throughput lift Google flags as experimental with output quality below standard Gemma 4. Licensed Apache 2.0 and fitting in 18GB VRAM quantized, it ships today with vLLM, Hugging Face, and MLX support, giving developers a concrete path to build low-latency inline editors and code-infill tools on consumer GPUs."}
Source: blog.google ↗
Instead of predicting words sequentially, it drafts an entire 256-token paragraph simultaneously.
Google DeepMind
Why this matters
- → Parallel decoding eliminates GPU underutilization bottleneck in local inference.
- → Enables real-time interactive tools (code infill, inline editing) on consumer hardware.
- → Shifts decode constraint from memory to compute, unlocking new use cases.
Parallel text generation
Also in this edition