HeadlinesBriefing favicon HeadlinesBriefing.com

Mercury 2: Diffusion LLM Delivers 1,000+ Tokens/Second

Hacker News •
×

Inception has unveiled Mercury 2, claiming it as the world's fastest reasoning language model. Built on a novel diffusion architecture rather than traditional autoregressive decoding, Mercury 2 achieves over 5x faster generation by producing multiple tokens simultaneously and refining them in parallel.

This architectural shift fundamentally changes the speed-quality equation for production AI. Current models force teams to choose between reasoning-grade quality and real-time latency, but Mercury 2 delivers both. The model generates 1,009 tokens per second on NVIDIA Blackwell GPUs while maintaining competitive quality with speed-optimized models. Pricing sits at $0.25 per million input tokens and $0.75 per million output tokens.

Mercury 2 excels in latency-sensitive applications where traditional LLMs create bottlenecks. For coding workflows, it provides autocomplete and editing suggestions fast enough to feel natural. In agentic loops, reduced per-call latency enables more inference steps without degrading performance. Voice interfaces benefit from reasoning-level quality within natural speech cadences, while search and RAG pipelines can add sophisticated reasoning without blowing latency budgets.