HeadlinesBriefing favicon HeadlinesBriefing.com

Building Fast Local Coding Agent with Gemma 4 on macOS

Hacker News •
×

After internet outages stranded the author without cloud-based coding assistants, they built a local coding agent using Gemma 4 26B-A4B with llama.cpp on macOS. The setup combines Metal acceleration, Multi-Token Prediction draft models, and multimodal support to create a self-hosted AI coding tool that works through OpenAI-compatible APIs.

Testing revealed significant performance gains with MTP speculative decoding. The base model achieved 58.2 tokens/second, while the MTP-enhanced setup reached 72.2 tokens/second—a 24% improvement. Prompt processing remained consistent at around 295 tokens/second across configurations. The author tested various --spec-draft-n-max values and found 3 draft tokens optimal on their M1 Max hardware.

Surprisingly, llama.cpp outperformed MLX for this workload despite MLX being Apple-optimized. The llama.cpp Metal implementation with MTP draft achieved 72.2 tokens/second, beating MLX's 45.8 tokens/second. Adding the multimodal projector enabled screenshot support without performance penalty, maintaining the 72.2 tokens/second generation speed.

The complete setup requires approximately 17GB of storage and runs on consumer hardware. This demonstrates that high-quality local AI coding agents are now viable for developers who need reliable, offline-capable tools. The open-source stack provides a practical alternative to cloud services that can fail when connectivity drops.