HeadlinesBriefing favicon HeadlinesBriefing.com

Flash-MoE Runs 397B Model on Laptop

Hacker News •
×

The Flash-MoE project demonstrates how a 397 billion parameter Mixture-of-Experts model can run on a MacBook Pro with just 48GB RAM. This pure C/Metal inference engine processes Qwen3.5-397B-A17B at 4.4+ tokens/second with production-quality output including tool calling, streaming the entire 209GB model from SSD through a custom Metal pipeline.

The project achieves this through innovative architecture combining 45 GatedDeltaNet layers with 15 standard full attention layers, plus key techniques like SSD Expert Streaming that loads only the 4 active experts per token. The FMA-Optimized Dequant Kernel delivers 12% better performance by rearranging math operations to utilize GPU fused multiply-add units efficiently.

Developers built this without Python or frameworks, using only C, Objective-C, and hand-tuned Metal shaders. The project represents a breakthrough in efficient large model inference, showing how specialized implementations can overcome hardware limitations through careful optimization of both compute and I/O pathways.