HeadlinesBriefing favicon HeadlinesBriefing.com

Six hidden tricks for building LLMs from scratch

Towards Data Science •
×

Building a large language model from the ground up forces engineers to confront choices most tutorials skip. The author rewrote GPT‑2 in pure PyTorch, then layered LoRA adapters, Rotary Positional Embeddings, KV cache and other tricks. Six hard‑won insights emerged, each touching cost, speed or capability of the final model.

First, the classic LoRA scheme updates two low‑rank matrices that amount to only 0.18 % of total weights, but as rank grows the scaling factor α/r shrinks updates, making fine‑tuning ineffective. Replacing r with √r—called RsLoRA—keeps variance constant, preserving weight magnitude. Second, positional information moves away from learned or sinusoidal tables; RoPE rotates query and key vectors, adding zero parameters while leaving token embeddings untouched.

Weight tying—sharing the token‑embedding matrix with the output projection—saves roughly 30 % of parameters in a 124 M model but fades to under 0.5 % at billion‑scale, explaining why newer LLMs drop the trick. Likewise, the community migrated from Post‑LayerNorm to Pre‑LayerNorm to favor training stability despite a modest hit to final accuracy. Those architectural decisions, once obscure, now dictate whether a home‑grown LLM can run on a single GPU or demand a cluster.