HeadlinesBriefing favicon HeadlinesBriefing.com

Airbnb's OpenTelemetry Migration Slashes Metrics Processing Overhead

Hacker News •
×

OpenTelemetry adoption cut CPU metrics processing from 10% to under 1% in Airbnb's production systems. The shift from StatsD to OpenTelemetry Protocol (OTLP) eliminated intermediate translation steps, improving reliability under high throughput. However, high-cardinality metrics caused memory issues, resolved by switching to delta temporality for select services. This trade-off introduced data gaps during failures but stabilized resource usage.

The migration revealed critical trade-offs in metric collection. While OpenTelemetry enabled Prometheus-native features like exponential histograms, high-volume services faced performance regressions. Airbnb addressed this by adjusting its common metrics library to use delta temporality, reducing in-process memory burden. This change was applied selectively to services emitting over 10K samples per second, preserving cumulative mode for others. The dual-write approach during migration allowed gradual transition without disrupting legacy systems.

To handle scaling, Airbnb implemented vmagent from VictoriaMetrics. This tool aggregates instance-level labels, reducing data volume before Prometheus ingestion. The architecture uses stateless routers and stateful aggregators to horizontally scale, processing over 100 million samples per second. Customizations included native histogram support and multitenancy. This centralized pipeline not only cut costs by an order of magnitude but also became a centralized hub for detecting instrumentation errors. The move positions Airbnb to handle future observability needs without protocol constraints, leveraging OpenTelemetry's industry-standard status.