HeadlinesBriefing favicon HeadlinesBriefing.com

Legacy Xeon Runs 26B Gemma Model Without GPU

Hacker News •
×

A single 2016 Intel Xeon E5‑2620 v4, paired with 128 GB DDR3, can run Gemma 4’s 26B‑A4B MTP drafters without a GPU. The author recycles a server that feels like a relic, yet the machine serves as a proof‑of‑concept that CPU‑bound inference is viable when tuned carefully. This setup challenges the notion that modern LLMs demand high‑end GPUs. The setup runs at roughly 0.3 tokens per second, matching the speed of a single‑threaded GPU.

To make the run practical, the author leverages llama‑cli with dozens of knobs: speculative decoding, CPU‑MoE routing, and runtime repacking. These options shift weight traffic into the Xeon’s L3 cache and reduce memory bandwidth stalls, turning a memory‑bound decoder into a near‑linear operation efficiently. The combination of these flags eliminates the need for GPU‑specific optimizations, making the setup portable across legacy hardware for large‑scale. Users can activate these switches via a single command line, simplifying deployment.

Running Gemma 4 on a DDR3 server proves that with the right software tuning, CPUs can serve as viable inference backends for demanding language models. The experiment underscores the importance of memory‑centric optimizations and highlights that legacy hardware can still power modern AI workloads. This approach also reduces memory footprint, allowing models to fit within 32 GB of RAM.