HeadlinesBriefing favicon HeadlinesBriefing.com

OpenAI's MRC Protocol Boosts Supercomputer Networking for AI Training

Hacker News •
×

OpenAI has released MRC (Multipath Reliable Connection), a new networking protocol developed with AMD, Broadcom, Intel, Microsoft, and NVIDIA. The protocol dramatically improves GPU-to-GPU data transfer in large AI training clusters. MRC is already deployed across OpenAI's NVIDIA GB200 supercomputers and Microsoft Fairwater systems, including Oracle Cloud Infrastructure in Abilene, Texas.

The protocol tackles a critical bottleneck in frontier AI training: network congestion and failures can ripple through entire jobs, leaving GPUs idle. MRC enables multi-plane networks that connect over 100,000 GPUs using just two tiers of switches, cutting power consumption and failure points. It sprays packets across hundreds of paths simultaneously, preventing congestion hotspots that typically slow down synchronous training jobs.

OpenAI released MRC through the Open Compute Project to establish shared infrastructure standards that benefit the broader AI ecosystem. The protocol extends RDMA over Converged Ethernet (RoCE) with SRv6-based source routing, building on techniques from the Ultra Ethernet Consortium. A detailed paper, "Resilient AI Supercomputer Networking using MRC and SRv6," accompanies the specification release.

This networking advancement directly supports OpenAI's Stargate initiative, which aims to build massive AI compute infrastructure. The protocol has already trained multiple OpenAI models using NVIDIA and Broadcom hardware, demonstrating practical real-world performance improvements for frontier AI systems.