HeadlinesBriefing favicon HeadlinesBriefing.com

NanoGPT Slowrun Achieves 10‑Fold Data Efficiency

Hacker News •
×

NanoGPT Slowrun has pushed a 1.8B‑parameter model ensemble to achieve 10x data efficiency on 100 M tokens, a milestone that would normally require 1 B tokens with a standard language‑model baseline. The team reports the result after only a few weeks of experimentation, challenging prevailing scaling laws that tie data and compute together.

Traditional scaling studies, such as Chinchilla, suggest a 5 M‑parameter model for 100 M tokens, a 3600‑fold mismatch from NanoGPT’s approach. By training an 18 B‑parameter ensemble—18 B total across nine models—the researchers demonstrate that compute can drive performance gains while keeping data demands low.

Key to the leap is a suite of architectural changes: exclusive self‑attention removes the self‑value projection; looped transformers reapply layers 15‑24 four times; and chain distillation trains models sequentially, each learning from its predecessor. Heavy regularization—weight decay up to 1.6 and dropout 0.1—further sharpens generalization.

The findings suggest that ensembling, aggressive regularization, and iterative layer reuse can break the data‑compute bottleneck. With 18 B total parameters, NanoGPT Slowrun already outperforms a single 2.7 B model trained on the same data, proving that smarter architecture beats raw scale when data is scarce.