HeadlinesBriefing favicon HeadlinesBriefing.com

OpenAI Uncovers 18-Year-Old libunwind Bug Behind ChatGPT Crashes

OpenAI Blog •
×

OpenAI's ChatGPT data infrastructure relies on Rockset, a cloud-native search system written in C++ for performance. When mysterious crashes began corrupting return addresses and misaligning stack pointers, the team faced an unusual debugging challenge. These weren't typical segfault patterns - crashes occurred after functions returned to invalid addresses, suggesting something deeper than application code.

The investigation started conventionally: examining individual core dumps to trace the corruption back through stack frames. However, Rockset's heavily inlined updateDocument method created an overwhelming search space. Application logs proved unreliable since stack traces themselves were corrupted, making classification impossible. Initial assumptions pointed to software issues, dismissing hardware bugs due to multi-region crash patterns.

After deeper analysis revealed small functions causing stack misalignment without any obvious code paths, the team shifted focus. They discovered two separate root causes: silent hardware corruption on an Azure host where CPU calculations failed, and an 18-year-old race condition in GNU libunwind that had gone unnoticed in this widely-used library.

The breakthrough came from treating crashes like epidemiological data rather than isolated incidents. By building comprehensive datasets of failure patterns, OpenAI identified correlations that individual core dump analysis missed. This approach highlights how modern AI infrastructure complexity demands systematic debugging methods beyond traditional techniques.