HeadlinesBriefing favicon HeadlinesBriefing.com

OpenAI's Data Agent Architecture: How Vanilla LLM Powers 90,000 Tables

ByteByteGo •
×

OpenAI built an internal data agent to solve its biggest analytics bottleneck: helping 4,000 users navigate 90,000 tables and 1.5 exabytes of data. Most teams waste hours just identifying the right tables before writing SQL. The agent answers natural language questions across Slack, IDE, and web portals, returning verified answers with the SQL and table sources.

The team chose deliberately simple architecture over complex multi-model systems. Using GPT-5.5 as a single foundation model, they rely on strong data infrastructure rather than sophisticated routing. The real engineering lives in context assembly, not model complexity. This approach proved sufficient for their scale.

Six context layers feed the agent: table usage metadata from popular dashboards, human annotations from owners, Codex enrichment that crawls pipeline code nightly, institutional knowledge from docs, memory of past corrections, and runtime context. Each table gets processed in batches of 100-200 over 5-10 minutes, capturing derivation logic and freshness.

The same Codex investment enabled migrating 600 petabytes and thousands of DAGs between clouds in just two months. Emma Tang's team demonstrated that reliable agents need less architectural complexity when built on solid data foundations. Simple tools, well-executed, outperform elaborate frameworks.