HeadlinesBriefing favicon HeadlinesBriefing.com

vLLM Hits 2.2k Tokens/s on H200 GPUs

Hacker News: Front Page •
×

vLLM's new V1 engine delivers 2.2k tokens per second per H200 GPU in multi-node production tests. The open-source project completed its V0 to V1 migration, crediting nearly 2,000 contributors for 950 recent commits. Performance gains come from Dual-batch Overlap (DBO) and DeepEP kernel integration, which reduce communication overhead. Meta, LinkedIn, and Mistral already run vLLM in production, validating its role in high-throughput LLM inference.

The team optimized Wide-EP (Expert Parallelism) for DeepSeek-style sparse models, which activate only 37B of 671B parameters per pass. Wide-EP combines expert and data parallelism to maximize effective KV cache over tensor parallelism. This approach frees 34GB per H200 and increases batch size. Expert Parallel Load Balancing (EPLB) shuffles weights dynamically to prevent idle ranks during imbalanced token routing, avoiding costly restarts in live deployments.

Disaggregated serving separates prefill and decode phases, crucial for expert-parallel workloads where tokens span multiple ranks. A single slow prefill can stall the entire EP group, so decoupling phases boosts overall throughput. vLLM also supports CUDA graph mode and Async scheduling. These advances align with llm-d and Dynamo stacks, pushing token-per-dollar costs lower for operators scaling DeepSeek and similar models.