HeadlinesBriefing favicon HeadlinesBriefing.com

etcd CrashLoopBack? Check Disk Latency First

Hacker News •
×

A team debugging persistent pod crashes in a Karmada control plane discovered the culprit wasn't application code but etcd's sensitivity to slow storage. The issue surfaced while building a cloud-edge demo using VMs on shared hardware, where etcd's fsync calls timed out, causing leader election failures and pod deaths. This reveals a fundamental distributed systems constraint: etcd's consistency demands fast, dedicated I/O.

Initial troubleshooting targeted typical Kubernetes suspects—resource limits and networking—but logs eventually pointed to etcd timing out. The root cause was I/O latency from the ZFS storage backend hosting the VMs. etcd relies on quick write-ahead log commits; when storage lags, it misses internal deadlines, destabilizing the entire control plane. The symptom was predictable pod crashes every few minutes.

The fix came from aggressive ZFS tuning on the host: disabling synchronous writes (`sync=disabled`), enabling fast LZ4 compression, turning off atime updates, and aligning recordsize to 8KB. The `sync=disabled` change alone stopped the timeouts by making fsync return instantly, trading a small durability risk for demo stability. The other settings reduced overall I/O pressure.

The lesson is concrete: if etcd-backed systems like Karmada exhibit unexplained crash loops, diagnose disk latency first. Monitor `etcd_disk_wal_fsync_duration_seconds`; sustained 99th-percentile values above 100ms indicate a storage problem, not a configuration one. Production demands dedicated SSDs, but for demos, ZFS tuning can rescue shared environments.