HeadlinesBriefing favicon HeadlinesBriefing.com

CFS Throttling Triggers Go Timeouts in Production

Hacker News •
×

A Go function in a production service kept hitting its 10‑second timeout, yet development and CI runs passed without issue. The culprit lay in the container’s CPU quota: a 2000m limit was interpreted by the Linux kernel as a 200‑millisecond budget per 100‑ms CFS period, not two cores, for a single request and it under such limits.

Average CPU metrics mislead operators. A pod showing 40 % utilization against a 2000m limit sounds healthy, but the 100‑ms CFS slice hands the container 200 ms of CPU time that can be spread across all node cores. A single burst can consume the full budget, leaving other goroutines starved, and the system under high load eventually.

Tools rarely expose this throttling. The /sys/fs/cgroup/cpu.stat file records nr_throttled and throttled_usec counters that climb when containers hit the limit. Operators can query kubectl exec <pod> -- cat /sys/fs/cgroup/cpu.stat to spot the problem before latency spikes become visible in dashboards and service performance degradation must be addressed promptly by developers and ops.

Long‑term fixes involve cgroup‑aware runtimes and application‑level starvation checks. Go 1.25 makes GOMAXPROCS cgroup‑aware, reducing oversubscription, while CockroachDB’s feedback controller trims background work when goroutine latency exceeds 1 ms. Until ecosystems adopt these safeguards, monitoring nr_throttled remains the only reliable early warning for latency‑sensitive workloads in production systems that require low latency and cannot tolerate timeouts.