HeadlinesBriefing favicon HeadlinesBriefing.com

Gemma 4 flag ablation reveals real speed gains on legacy Xeon

Hacker News •
×

A developer revisited the 25‑flag command that runs Gemma 4, a 26‑billion‑parameter model, on a 2016 Xeon E5‑2620 v4 without a GPU. After the original post went viral on Hacker News, many users copied the flag list without knowing which options actually affect performance. The author performed a systematic ablation, disabling one flag at a time to measure its impact and to isolate flag's contribution to latency.

The experiment required 174 fresh server launches, each reloading 25 GB of weights from a spinning disk before answering a token. Three prompts—short chat, 5k‑token summarization, and code generation—were run under llama‑server to capture speculative‑decoding telemetry. Results showed flash attention, core thread count, and the drafter configuration were the only levers that moved the throughput needle.

Turning off the drafter entirely improved chat speed marginally but boosted code generation by 28 % and sped up long‑document summarization by 54 %. Fixed draft lengths outperformed the autotune setting, which proved the worst speculation mode across workloads. The study proves that most flags add noise, and a leaner configuration can double token output on this legacy hardware for production workloads on similar servers.