HeadlinesBriefing favicon HeadlinesBriefing.com

SQL on Kafka: Simplifying Data Queries

DEV Community •
×

The tech world has long relied on stream processing engines like Flink, ksqlDB, and Kafka Streams to handle continuous computations over unbounded data. These tools, while powerful, come with significant operational costs, as Confluent's documentation acknowledges. Teams often need experts dedicated to deploying and maintaining these systems, which can be complex and prone to issues like performance tuning and checkpoint failures.

Engineers frequently ask straightforward questions about Kafka data, such as 'What is in this topic right now?' or 'Where is the message with this key?' These queries are not streaming problems but bounded lookups over historical data. They do not require windows, watermarks, checkpoints, or state recovery. This realization suggests that a simpler architecture could suffice for many use cases.

Kafka data is inherently structured for this kind of querying. It appends records to log segments, which are immutable once closed. Each partition is an ordered sequence of records, and Kafka maintains sparse indexes allowing efficient seeking by offset and timestamp. With tiered storage, segments can live in object storage like S3, making Kafka data organized like a SQL-on-files dataset.

The operational overhead of streaming engines is often unnecessary for these common queries. Streaming engines pay for capabilities like distributed state backends and coordinated checkpoints, which are overkill for simple lookups. For instance, Riskified migrated from ksqlDB to Flink due to the former's limitations on evolving schemas and operational complexity. Most Kafka clusters run at or below 1 MB/s, indicating that small-data operations are common, yet teams bear the costs of big-data infrastructure.