HeadlinesBriefing favicon HeadlinesBriefing.com

Gemma 4 12B Brings Multimodal AI to Laptop Hardware

Google DeepMind Blog •
×

Google DeepMind rolled out Gemma 4 12B, a 12‑billion‑parameter multimodal model that fits on a standard laptop. The design eliminates separate vision and audio encoders, allowing raw pixel and waveform streams to enter the LLM core directly. It runs with just 16GB of VRAM, delivering local agentic reasoning without cloud round‑trips.

Benchmark tests place Gemma 4 12B within striking distance of DeepMind’s larger 26B Mixture‑of‑Experts system, yet its memory demand is under half. Community adoption tops 150 million downloads, fueling projects from wearable robotic arms to enterprise AI security. The model ships under an Apache 2.0 license, encouraging broad integration.

The unified architecture replaces the vision encoder with a single matrix‑multiply embedding layer and discards the audio encoder entirely, projecting raw audio into the token space. Developers can access the model via LM Studio, Ollama, Hugging Face Transformers, llama.cpp, SGLang and vLLM, and fine‑tune efficiently with Unsloth.

By delivering near‑MoE performance on everyday hardware, Gemma 4 12B lowers the barrier for privacy‑focused, on‑device AI that processes vision and sound together. This capability enables new agentic workflows without relying on costly cloud services, making sophisticated multimodal reasoning practical for developers and end users alike.