HeadlinesBriefing favicon HeadlinesBriefing.com

Run Three LLMs on a Single 8GB GPU with a C++ Daemon

Towards Data Science •
×

An engineer with a single NVIDIA GTX 1080 and three lightweight LLMs—Llama 3.2 1B, Qwen 2 0.5B, and Smol LM 360M—hits a wall. Launching each agent in its own llama‑completion process fills the 8 GB VRAM with pre‑reserved key‑value buffers, leaving the second and third jobs out of memory and crashing in a single machine, the crash repeats each time the GPU runs out.

The root cause lies in llama‑cpp's design: when a context is created, it reserves the full KV cache upfront to guarantee smooth decoding. With a 172 032‑token window, the first model consumes over 6 GB before producing output, pushing the card past 80 % and turning subsequent allocation attempts into a coin flip every subsequent agent faces a 50‑50 chance of success.

To avoid this race, the author ships lmxd, a C++ daemon that owns the GPU and tracks memory usage with a simple ledger. Agents register through a Unix socket; the daemon checks whether allocated_bytes plus the new model's estimate stays below a 90 % cap before loading the weights. If not, it returns a structured denial.

By enforcing pre‑allocation checks, lmxd lets all three LLMs coexist on a single 8 GB card, eliminating the out‑of‑memory lottery. Developers can now run code‑generation, security review, and documentation agents side‑by‑side without upgrading hardware, making parallel inference practical for budget‑constrained teams. This approach also reduces GPU utilization spikes, leading to more predictable latency across agents.