HeadlinesBriefing favicon HeadlinesBriefing.com

EAGLE 3.1 Boosts Speculative Decoding Performance

Hacker News •
×

EAGLE 3.1 has been introduced through collaboration between the EAGLE team, vLLM, and TorchSpec, representing a significant advancement in speculative decoding algorithms. This version addresses performance degradation issues that occurred under different chat templates, long-context inputs, or out-of-distribution system prompts, making it more practical for real-world deployment.

The EAGLE team identified "attention drift" as the core problem, where the drafter gradually shifts focus from sink tokens to its own generated tokens during deeper speculation. EAGLE 3.1 introduces FC normalization after each target hidden state and feeds post-norm hidden states into subsequent decoding steps, creating behavior more like recursively invoking the drafter rather than simply appending layers.

Integration with vLLM enables seamless deployment, with backward compatibility preserved. Benchmarks show 2.03× higher per-user output throughput at concurrency 1, with meaningful speedups (1.71× at C=4, 1.66× at C=16). The open-source draft model for Kimi K2.6 demonstrates practical application of these improvements in production environments.