HeadlinesBriefing favicon HeadlinesBriefing.com

YAML Pipelines Replace PySpark at Mindbox

Towards Data Science •
×

Mindbox data engineer Kiril Kazlou details how his team replaced PySpark pipelines with a YAML-based approach that reduced data pipeline delivery time from weeks to just one day. Previously, building a single data pipeline required three weeks of developer work using PySpark, creating bottlenecks for business metrics and data marts. The new approach empowers analysts with zero Python experience to independently create and maintain data pipelines.

The solution combines three specialized tools: dlt for data ingestion via YAML configuration, dbt on Trino for SQL-based transformations, and Airflow + Cosmos for orchestration. This declarative approach splits responsibilities, allowing analysts to handle the business logic while maintaining the performance benefits of Trino for processing large datasets. The entire pipeline configuration requires just four YAML files instead of complex Python scripts.

This shift transformed their workflow from requiring developer handoffs for each metric to enabling analysts to directly implement business logic. The approach handles 90% of their pipelines efficiently with Trino's federated access to multiple data stores. By eliminating Python dependencies for routine pipeline creation, Mindbox has fundamentally changed how their organization approaches data infrastructure.