HeadlinesBriefing favicon HeadlinesBriefing.com

Geometric fix narrows gap for small language models

Hacker News •
×

Researchers at ICML 2026 identified a geometric flaw in small language models they call embedding condensation. As tokens pass through Transformer layers, their vectors converge into a narrow cone, raising pairwise cosine similarity. The effect appears early at initialization, worsens in models with fewer parameters, and persists across multiple benchmarks.

To combat this collapse, the authors propose a regularizer named dispersion loss that pushes embeddings apart on the unit hypersphere. Variants such as ℓ2‑repel and orthogonalization were tested, each encouraging uniform angular distribution. Experiments on GPT‑2‑style families show the loss restores diversity, narrowing the performance gap without adding parameters and improves downstream task accuracy.

Controlled experiments that varied only the MLP dimension while fixing layers, embedding size, and data still exhibited the size‑dependent condensation trend. Knowledge distillation from larger teachers failed to transfer resistance, indicating the phenomenon stems from intrinsic geometry rather than learned weights. These findings reinforce the need for explicit regularization. Thus, the collapse cannot be remedied by simple teacher‑student pipelines.

Embedding condensation limits representational capacity, explaining why larger models often outperform their smaller counterparts beyond sheer parameter count. The study demonstrates that geometric regularization can yield more expressive small models, offering a practical tool for developers constrained by compute budgets. Applying dispersion loss during mid‑training visibly reduces cone‑like clustering in token embeddings in real‑world deployments.