HeadlinesBriefing favicon HeadlinesBriefing.com

DoorDash's LLM Testing Framework

ByteByteGo •
×

DoorDash developed a sophisticated testing system to address hallucination problems in their customer support chatbot. The food delivery platform faced challenges with LLMs' non-deterministic nature, replacing their predictable hand-built decision trees. Their solution involved creating a "simulation and evaluation flywheel" that enables rapid iteration without risking real customer experiences.

The flywheel combines an offline simulator using LLMs to generate realistic multi-turn conversations and an evaluation framework that automatically grades performance. The simulator draws from historical transcripts, creating dynamic responses based on detailed behavioral profiles rather than scripted messages. This approach allows DoorDash to test complex edge cases their previous infrastructure couldn't handle.

With over 50 evaluations covering hallucination detection and quality metrics, the system runs more than 200 simulated conversations in under five minutes. What previously took days of manual testing now takes hours. The evaluation suite serves as both a quality check and regression test, ensuring changes work before deployment.