HeadlinesBriefing favicon HeadlinesBriefing.com

DeepEP kernel powers wide expert parallelism for MoE inference

Hacker News •
×

Large language models demand dozens of GPUs, but inference isn’t embarrassingly parallel. To keep dozens of accelerators talking, systems combine Tensor, Pipeline, Context and Expert Parallelism. For Mixture‑of‑Experts (MoE) models, the dominant strategy is wide Expert Parallelism, which routes each token to multiple experts across the cluster. DeepSeek’s production deployment, highlighted in vLLM, demonstrates the approach, hitting 2.2k tokens per second per H200 GPU.

The EP kernel handles volatile routing decisions at runtime. Tokens start on a single data‑parallel rank, while experts scatter across ranks, so most token‑expert pairs sit on different GPUs. The kernel gathers activations, launches a grouped GEMM per expert, then returns results, using a ragged buffer for throughput or a fixed‑size buffer for low latency. DeepEP moves data over RDMA when experts reside on nodes.

DeepEP, the library that shaped modern EP kernels, splits communication into two phases. During prefill, large batches allow overlapping transfers with compute, so the system allocates just‑enough memory based on a coordination pass that learns token counts. At decode time, latency dominates, so a pre‑padded buffer with known addresses eliminates the coordination step. This design lets MoE inference scale to thousands of tokens per second.