HeadlinesBriefing favicon HeadlinesBriefing.com

OpenAI's MRC Protocol Revolutionizes 131K-GPU Network Design

Towards Data Science •
×

OpenAI and partners have deployed a radical networking protocol called MRC across their 131,000-GPU training infrastructure, challenging three decades of data center networking consensus. The Multipath Reliable Connection protocol, released through the Open Compute Project, eliminates traditional Layer 3 routing entirely—no OSPF, BGP, or dynamic forwarding state exists in the fabric.

Traditional networks route each connection through a single path, creating bottlenecks when links fail. MRC instead splits each 800 Gb/s NIC into eight 100 Gb/s links across independent network planes, spraying packets randomly across hundreds of paths. This approach tolerates packet loss deliberately, with intelligent retransmission handling. The design emerged from necessity: at 100,000+ GPU scale, a single straggler can stall an entire training step, costing roughly $300,000 per hour in compute expenses.

The multi-plane architecture reduces a conventional three-tier network to just two tiers while supporting more than double the GPUs with fewer switches and optics. Each plane operates independently, so link failures affect only 0.4% of bandwidth rather than 3%. This eliminates the failure cascades that would crash traditional training jobs when optical transceivers glitch across thousands of links.

MRC's counterintuitive decisions—accepting loss, abandoning deterministic routing, eliminating control planes—work because they address the fundamental problem of tail latency at extreme scale. For infrastructure teams building next-generation AI clusters, MRC demonstrates that conventional networking wisdom breaks down when coordinating hundreds of thousands of GPUs.