HeadlinesBriefing favicon HeadlinesBriefing.com

Distributed SQL for Ultra-Wide Tables

Hacker News: Front Page •
×

A Hacker News user described hitting practical limits with ML feature engineering and multi-omics data. Standard SQL databases cap out around 1,000 columns, while columnar formats like Parquet need Spark or Python pipelines. OLAP engines assume narrow schemas, and feature stores explode data into joins. At extreme width, metadata handling and query planning become bottlenecks.

The user experimented with a different approach that ditches joins and transactions. Instead, the system distributes columns rather than rows, making SELECT the primary operation. This design enables native SQL selects on tables with hundreds of thousands to millions of columns. Predictable sub-second latency appears possible when accessing a subset of columns on modest hardware.

Performance benchmarks on a two-server cluster showed creating a 1M-column table took about six minutes. Inserting a single column with 1M values took two seconds, while selecting roughly 60 columns over 5,000 rows finished in one second. The author asked Hacker News readers how they handle ultra-wide datasets without heavy ETL or complex joins.