HeadlinesBriefing favicon HeadlinesBriefing.com

Hash Layers and Staircase Attention Split Model Size from Compute

Hacker News •
×

When researchers ask whether a model’s power comes from sheer parameter count or raw compute, the answer has been blurry. Traditional thinking ties FLOPs to the number of weights, because each parameter participates once per token. Two new papers break that link, showing you can grow size without extra work or boost compute while keeping parameters fixed.

The first contribution, called hash layers, replaces learned routing in mixture‑of‑experts with a deterministic hash of input tokens. Each token maps to a fixed expert, eliminating training overhead. Experiments on the pushshift.io Reddit benchmark report that a 1.28 billion‑parameter model activates only 17 % of its weights per example and outperforms the Switch MoE baseline. Scaling to 4.5 billion parameters yields faster updates than competing sparse models.

The second paper introduces staircase attention, a family where the same Transformer block repeats across time or depth, multiplying FLOPs without adding parameters. Ladder variants stack identical layers, while the Staircase version shifts each block forward, creating a recurrent state. On language‑modeling and dialogue tasks both variants beat standard Transformers, confirming that compute per parameter is a fertile optimization axis.