HeadlinesBriefing favicon HeadlinesBriefing.com

Doubleword makes MI300X run DeepSeek V4 Flash amid GPU shortage

Hacker News •
×

Doubleword is building an inference cloud aimed at high‑volume workloads, but it must navigate today’s GPU scarcity. AMD unveiled the MI300X in December 2023 as its answer to NVIDIA’s H100, offering 192 GB of HBM3 and comparable FP8 throughput at roughly half the list price. Rental rates already undercut equivalent NVIDIA parts.

The hardware advantage stalls because AMD’s software stack still relies on a niche FP8 dialect called fnuz, which lacks signed zero and infinity. vLLM’s DeepSeek‑V4‑Flash expects the Open Compute Project FP8 format, so the same byte is interpreted with a bias off by a factor of two. Doubleword’s engineers patched the decoder and quantisation paths to align the tensors.

DeepSeek‑V4’s sparse attention relies on multiple tuned kernels that AMD supplies via AITER, the counterpart to NVIDIA’s cuBLAS and FlashAttention. AITER coverage targets newer CDNA4 GPUs, leaving the MI300X’s CDNA3 cores without optimized paths for paged MQA logits and sparse MLA prefill. The team added ROCm helpers and fallbacks to Triton where AITER failed.

To cut Python launch overhead, the engineers captured the decode loop in HIP graphs, AMD’s analogue of CUDA graphs. Static tensors replace ragged allocations, ensuring repeatable execution. Initial profiling shows the sparse MLA and MoE paths dominate runtime, but matmul costs remain modest. The work proves MI300X can host DeepSeek‑V4‑Flash with careful kernel tuning.