HeadlinesBriefing favicon HeadlinesBriefing.com

From‑Scratch GPT‑2 Clone Built in C/CUDA

Hacker News •
×

Developer JustVugg released nanoeuler, a GPT‑2‑class language model written entirely in C and CUDA without any deep‑learning frameworks. The repo contains hand‑crafted forward and backward passes, a byte‑level BPE tokenizer, and a FlashAttention kernel. It targets researchers who want to see how parameters, data, and GPU kernels interact at a low level, to demystify LLM construction.

The engine runs on a single consumer GPU; a ~116M‑parameter model trains on an RTX 4070 in a few days, while a tiny 0.76M showcase runs on CPU with OpenMP. Architecture mirrors modern transformers: RMSNorm, rotary embeddings, SwiGLU feed‑forward, grouped‑query attention and multi‑token prediction. Every kernel is validated against a CPU reference and gradient‑checked to 1e‑4 relative error, and achieves roughly three‑fold speedup thanks to FlashAttention.

nanoeuler is positioned as a research and educational artifact rather than a production chatbot; its 116M model produces fluent English but lacks real‑world knowledge. By exposing the full training pipeline—from tokenizer creation through pretraining on a books‑plus‑web corpus to supervised fine‑tuning—the project demonstrates that end‑to‑end LLM development can be understood without opaque libraries. It compiles with gcc 13 on Linux and needs no external ML deps.