HeadlinesBriefing favicon HeadlinesBriefing.com

Gaudi Cloud Performance Breakthrough: Peer Direct Solves Host Memory Bottleneck

Towards Data Science •
×

When Intel's Habana Labs introduced Gaudi accelerators to Amazon's EC2 DL1 instances, cloud performance suffered catastrophic degradation. Models training on multiple nodes experienced up to 50% performance loss because network topology forced all data through host memory, creating bottlenecks that undermined Gaudi's core design. This threatened the entire deployment's viability.

Gaudi's architecture embeds NICs directly in silicon with ten 100 Gbps interfaces supporting RDMA with RoCE v2, enabling direct memory access between devices. But Amazon's cloud deployment used standard host NICs instead of Gaudi's built-in networking for cost and infrastructure reasons. This routing change meant data had to duplicate through host DRAM, CPU processing, and TCP/IP over host NICs before reaching destination devices. The added latency and CPU overhead completely destroyed distributed training scalability.

The solution, Peer Direct, delivered RDMA-like performance using AWS Elastic Fabric Adapter and libfabric. The team created a memory registration system using Linux kernel DMA-BUF framework to allow libfabric direct access to Gaudi's high-bandwidth memory. An LRU cache for memory registrations eliminated most overhead. The result: 1.5 to 2x throughput increase for collective operations, with up to 1.76x better performance on larger messages. CPU overhead that previously bottlenecked training completely disappeared.