HeadlinesBriefing favicon HeadlinesBriefing.com

Google DeepMind Unveils DiffusionGemma for 4x Faster Text Generation

Google DeepMind Blog •
×

Google DeepMind released Diffusion Gemma, an experimental open model that generates text up to 4x faster than traditional autoregressive language models. The 26B Mixture of Experts architecture produces entire 256-token blocks simultaneously rather than processing tokens sequentially. Researchers and developers can now explore speed-critical interactive workflows that were previously constrained by inference latency.

Built on Gemma 4 intelligence and Gemini Diffusion research, the model uses a novel diffusion head to maximize generation speed. It achieves 1000+ tokens per second on NVIDIA H100 and 700+ tokens per second on RTX 5090 while fitting within 18GB VRAM when quantized. The bi-directional attention mechanism allows every token to attend to all others, enabling non-linear text structures and in-line editing capabilities.

However, this speed comes with trade-offs. Diffusion Gemma's output quality lags behind standard Gemma 4 models, making it unsuitable for production applications demanding maximum quality. The model excels in local, low-concurrency inference scenarios where hardware utilization matters most. For cloud serving with high query-per-second rates, traditional autoregressive models remain more efficient.

Developers can download the Apache 2.0 licensed weights from Hugging Face and integrate them using MLX, vLLM, or Transformers. Fine-tuning tutorials are available for Unsloth, NVIDIA NeMo, and Hackable Diffusion. This represents Google's push to explore alternative architectures beyond conventional LLM approaches.