HeadlinesBriefing favicon HeadlinesBriefing.com

Sparse File LRU Cache Cuts S3 Costs for Amplitude

Hacker News: Front Page •
×

Amplitude engineers discovered that sparse files could reduce storage costs on local NVMe SSDs while keeping analytics data in Amazon S3. By allocating disk space only when a column is accessed, the system avoids wasting expensive SSD capacity on rarely‑used data for large datasets daily.

Typical caching approaches either pull entire files from S3 or split columns into separate files. The former wastes space; the latter inflates file‑system metadata and struggles with small columns. Sparse files sit between these extremes, keeping only the logical blocks that contain requested columns on.

To track which blocks are cached, Amplitude stores metadata in a local RocksDB instance. The database records logical block presence and last‑access timestamps, enabling an LRU eviction policy. Variable‑sized blocks align with the columnar format’s header, reducing fragmentation and minimizing write overhead for read operations.

The sparse file LRU cache cuts S3 GETs, slashes file‑system metadata, and lowers IOPS needed for cache maintenance. Such a low‑level file‑system tweak delivering system‑wide performance gains is rare, underscoring how storage‑level optimizations can reshape analytics pipelines for large data workloads across multiple regions daily.