HeadlinesBriefing favicon HeadlinesBriefing.com

Ollama Leverages Apple MLX for Major Speed Boost on Silicon

Hacker News •
×

Ollama just dropped a preview release, version 0.19, that switches its Apple Silicon backend to Apple’s MLX framework. This move capitalizes on MLX’s unified memory architecture, resulting in substantial speed improvements across the entire M-series chip lineup. Developers running demanding personal assistants or coding agents like Claude Code should see immediate responsiveness gains.

Testing on the M5 series specifically illustrates the performance jump, leveraging new GPU Neural Accelerators for both time-to-first-token and raw generation speed. Performance metrics shared show significant gains over the previous 0.18 implementation when running quantized models like Alibaba’s Qwen3.5-35B-A3B. This integration aims to bring local LLM inference closer to production standards.

Another key technical addition involves support for NVIDIA’s NVFP4 quantization format, allowing users to maintain higher accuracy while reducing memory overhead. Furthermore, caching mechanisms received an overhaul, reusing conversation caches across branches to lower memory use and improve efficiency for agentic workflows. Ollama 0.19 is now available for download.

This optimization drive demonstrates a clear commitment to maximizing performance on local hardware, particularly for specialized tasks like code generation where latency matters deeply. The focus remains on delivering top-tier local LLM experiences, especially for users with high-spec Macs exceeding 32GB unified memory.