HeadlinesBriefing favicon HeadlinesBriefing.com

Gemma 4's New Speed Upgrade: MTP Cuts Latency by Half

Ars Technica •
×

Google rolled out its Gemma 4 open models this spring, adding a new speed tier for local AI. The company introduced Multi‑Token Prediction, or MTP, a speculative decoding technique that guesses future tokens and cuts generation time in half compared to standard autoregressive flow. The upgrade keeps output quality unchanged while lowering latency on consumer hardware.

Both Gemma 4 and Google’s flagship Gemini share the same core architecture, but Gemma is tuned for edge execution. One high‑power accelerator can run the 26‑billion‑parameter model at full precision; quantization lets it run on a consumer GPU like the NVIDIA RTX PRO 6000. MTP uses a lightweight 74‑million‑parameter drafter that shares a key‑value cache to skip costly recomputation.

Google also switched Gemma 4’s license to the more permissive Apache 2.0, moving away from the custom terms used before. This change opens the door for developers to integrate and modify the model without restrictive clauses. For users, the combination of faster token prediction and an open license means higher performance on modest hardware and greater freedom to experiment. Hardware demands stay manageable now today.