HeadlinesBriefing favicon HeadlinesBriefing.com

DeepSeek V4 Gets Metal-Optized Local Engine

Hacker News •
×

DeepSeek V4 Flash now has a specialized local inference engine called ds4, designed exclusively for Metal-based Apple systems. This narrow-purpose implementation isn't a generic GGUF runner but a custom graph executor optimized for the model's architecture. The project focuses on making the 284-parameter model practical on personal machines, with special attention to compressed KV cache technology that enables long-context inference on local hardware.

The engine leverages DeepSeek V4 Flash's unique advantages: faster processing due to fewer active parameters, thinking sections up to 5 times shorter than competing models, and compatibility with 2-bit quantization. Performance tests show MacBook Pro M3 Max systems handling 250.11 tokens/second during prefill with long contexts. The Metal implementation achieves significantly better throughput than CPU paths, though the latter remains unstable on current macOS versions due to virtual memory bugs.

ds4.c represents a deliberate bet on single-model optimization rather than broad framework support. It only works with DeepSeek V4 Flash GGUF files and requires specific quantization approaches. The project emerged from recognizing that compressed KV caches should be treated as disk citizens, not just RAM components. While currently alpha quality, it aims to provide a finished end-to-end experience for local AI inference on Apple's high-end personal computing hardware.