HeadlinesBriefing favicon HeadlinesBriefing.com

Expanse boosts GPU cluster efficiency by cutting waste

Hacker News •
×

Four ex‑quant‑fund engineers launched Expanse, a service that plugs into SLURM or Kubernetes to predict GPU and CPU needs before jobs run. By analysing source code, submission scripts and live telemetry, it supplies resource recommendations, failure alerts and line‑level optimisation tips. It runs without changing existing scripts, preserving workflow.

Measurements on a national‑scale HPC system showed 59 % of compute wasted, equivalent to roughly $8.5 M in cloud‑rate costs for a single month. Their earlier model, built at EPCC, beat baselines by 34 % and outperformed leading LLMs by eight‑fold on the same prediction task. Continuous fine‑tuning lets accuracy improve as more jobs run. Clients see faster queues as jobs finish on right‑sized hardware.

Expanse presents three user‑facing features: predictive resource sizing with confidence intervals, a live telemetry dashboard that adds only single‑digit overhead, and automated failure diagnosis that returns concise, code‑level fixes. The startup now runs paid pilots, charging per‑cluster after a two‑week measurement phase. It also plugs into CI pipelines for automated planning.