HeadlinesBriefing favicon HeadlinesBriefing.com

Databricks Rate Limit Overhaul

ByteByteGo •
×

When Databricks launched real-time model serving, their simple rate limiting architecture couldn't handle the increased load. The system using Envoy, Ratelimit Service, and a single Redis instance faced three critical issues: high tail latency, diminishing returns with scaling, and a single point of failure. The initial design struggled to process the orders of magnitude more traffic from AI workloads.

The team redesigned their approach using Dicer, a routing layer that keeps counters in memory rather than Redis. This eliminated network hops on the critical path, dramatically reducing latency. The horizontally scalable partitioning system removed the single point of failure, and each replica became the authoritative store for its own slice of keys.

The most impactful change was implementing batch-reporting, where clients make no remote calls on the rate limit path. Instead, they use optimistic rate limiting and report counts every 100 milliseconds. This approach reduced tail latency by roughly a factor of ten, turned spiky traffic into constant reports, and made server-side load predictable for the first time.