HeadlinesBriefing favicon HeadlinesBriefing.com

ZeRO & FSDP: How AI Training Scales Across Multiple GPUs

Towards Data Science •
×

Distributed training with Distributed Data Parallelism solves throughput problems but creates memory redundancy. Every GPU holds complete copies of model parameters, gradients, and optimizer states, making it impossible to train massive models like GPT-3 on standard hardware. ZeRO eliminates this redundancy through three optimization levels that partition different components across GPUs.

ZeRO-1 partitions only optimizer states, reducing memory usage by roughly 1/N where N is the number of GPUs. ZeRO-2 goes further by partitioning both optimizer states and gradients, using reduce-scatter instead of all-reduce to save communication bandwidth. ZeRO-3 partitions everything—optimizer states, gradients, and model parameters—allowing training of models far larger than any single GPU could handle. The technique works by having each GPU store only its assigned partition while coordinating updates through collective operations.

For a 7-billion parameter model using Adam optimizer, ZeRO-3 reduces per-GPU memory from 112GB to just 14GB across 8 GPUs. This enables training of trillion-parameter models on commodity hardware. The approach is implemented in PyTorch through Fully Sharded Data Parallel (FSDP), which combines ZeRO's memory efficiency with DDP's computational parallelism.