HeadlinesBriefing favicon HeadlinesBriefing.com

DeepSeek's $5.6M LLM Architecture Sparks Open-Source Race

ByteByteGo •
×

In December 2024, DeepSeek released V3, claiming they trained a frontier-class model for just $5.576 million using Multi-Head Latent Attention to slash memory usage. Their expert routing strategy avoided typical performance penalties, while aggressive FP8 training cut costs further. Within months, Moonshot AI's Kimi K2 team adopted DeepSeek's architecture as their starting point, scaled it to a trillion parameters, and invented a new optimizer to solve training stability challenges at that scale.

Then in February 2026, Zhipu AI's GLM-5 integrated DeepSeek's sparse attention mechanism while contributing a novel reinforcement learning framework. This is how the open-weight ecosystem actually works: teams build on each other's innovations in public, and the pace of progress compounds. Almost every major open-weight LLM released at the frontier in 2025 and 2026 uses a Mixture-of-Experts (MoE) transformer architecture. The reason is simple: dense transformers activate all parameters for every token, making hundreds of billions of parameters prohibitively expensive.

MoE solves this by replacing monolithic feed-forward layers with multiple smaller "expert" networks and a learned router that decides which experts handle each token. This lets models store knowledge equivalent to 671 billion parameters while only computing 37 billion per token. Think of a specialist hospital with 384 doctors on staff, but only 8 in the room for any given patient. You benefit from the knowledge of 384 specialists while only paying for 8 at a time.