HeadlinesBriefing favicon HeadlinesBriefing.com

AMD MI355X Delivers 2626 Tokens/Second at Half Blackwell Cost

Hacker News •
×

Wafer achieved 2626 tok/s/node running GLM5.2 on AMD's Instinct MI355X, delivering performance comparable to NVIDIA's Blackwell systems at over 2x lower cost. The setup leveraged sglang with MXFP4 quantization via AMD Quark, hitting 213 tok/s on long-context workloads through speculative decoding optimizations. AMD's hardware costs roughly 2.75x less than comparable NVIDIA GPUs while offering similar specs, making it an attractive alternative as inference demand outpaces Blackwell supply.

The team faced significant friction with AMD's ROCm stack, which lacks the day-0 support and software maturity of CUDA. Unlike NVIDIA's optimized paths, sglang's ROCm image required manual fixes for speculative decoding, including kernel guards and quantization lookup adjustments to handle GLM-5.2's mixed-precision architecture. These framework-level issues typically delay AMD deployments by weeks.

Performance gains came from switching tensor parallelism configurations and tuning MoE kernels specifically for the fp4 path. The default sglang image fell back to slow Fly DSL heuristics for GLM-5.2's fp4 MoE implementation. After custom kernel selection tuning on the model's specific shapes (6144 dimension, 2048 inter, E=256), throughput jumped significantly.

This work demonstrates that AMD MI355X can compete on performance-per-dollar for frontier model inference, though software optimization overhead remains substantial. The gap is closing as AMD's ecosystem matures, suggesting NVIDIA's CUDA advantage is eroding in real-time rather than being fundamentally insurmountable.