HeadlinesBriefing favicon HeadlinesBriefing.com

Google adds Multi‑Token Speed Boost to Gemini Nano on Pixel

Google AI Blog •
×

Google has retrofitted its frozen Gemini Nano v3 models with a Multi‑Token Prediction (MTP) head, delivering on‑device language generation that runs up to twice as fast on Pixel smartphones for everyday users. The new architecture lets the model draft several tokens in a single pass, cutting latency and power use for features like notification summaries and on‑the‑fly proofreading.

The MTP head attaches to the final transformer layers of the frozen backbone, reusing the model’s high‑dimensional activations instead of spawning a separate drafter. This zero‑copy design shares the main KV cache, saving roughly 130 MB of RAM per instance and eliminating prefill delays. Benchmarks on Pixel 9 show speed gains of 50 % or more versus standalone drafters in real‑world apps.

Because drafts are verified against the frozen Gemini Nano backbone, final output remains bit‑for‑bit identical, preserving safety alignment while delivering up to two extra tokens per inference pass. Developers gain a drop‑in speed boost without retraining task‑specific models, and end users notice faster AI‑assisted text without extra battery drain on current Pixel devices.