HeadlinesBriefing favicon HeadlinesBriefing.com

Gemma 4 QAT: Mobile‑Ready Models Cut Memory to 1 GB

Hacker News •
×

Google’s Gemini team rolled out new Gemma 4 checkpoints that fuse Quantization‑Aware Training with a mobile‑specific schema, slashing memory needs for edge devices. The update follows earlier Multi‑Token Prediction tweaks and a 12‑billion‑parameter model that bridged the 4B‑26B gap. The result is locally deployable models that keep full textual fluency.

By training with simulated quantization, QAT preserves quality that standard post‑training schemes lose. The release ships Q4_0 checkpoints and a novel format targeting phones; the E2B variant now fits under 1 GB of VRAM. Developers can drop unused audio or vision encoders to cut memory further, enabling longer chats on a single device.

The mobile schema pre‑computes static activations, aligns channel‑wise quantization with phone accelerators, and applies 2‑bit compression only to token‑generation layers while keeping reasoning layers high‑precision. Embedding and KV cache optimizations further shrink the active footprint, letting users run a full‑size Gemma 4 model without cloud reliance.

Google distributes the new weights on Hugging Face in GGUF for llama.cpp and compressed tensors for vLLM, while LiteRT‑LM and Transformers.js support on‑device inference. With these tools, developers can iterate locally, fine‑tune with Hugging Face Transformers, and deploy efficiently across Apple Silicon and standard GPUs. The Gemma 4 QAT set delivers edge‑ready performance today.