HeadlinesBriefing favicon HeadlinesBriefing.com

AI chatbots fail 80%+ in medical diagnoses with incomplete data

Financial Times Companies •
×

80% failure rates in early diagnoses highlight AI chatbots' limitations, new research shows. A study in *Jama Network Open* tested leading large language models (LLMs) using clinical vignettes and found they struggled when patient data was sparse. Results revealed that while models like Claude and Gemini could accurately diagnose cases with complete information, their early-stage reasoning faltered consistently. This gap raises alarms about deploying AI as standalone diagnostic tools, particularly in settings where users might input vague symptoms or fragmented health records.

The research evaluated 21 LLMs, including OpenAI, Anthropic, Google, xAI, and DeepSeek. All models exceeded 80% failure rates in differential diagnosis scenarios, where incomplete data forced premature conclusions. Performance improved to under 40% errors for final diagnoses with full data, but this still lags behind human clinicians. Lead author Arya Rao noted, "These models excel at finalizing diagnoses but lack the exploratory reasoning doctors use early in care." The study underscores risks for patients and providers relying on AI without clinical oversight. Companies like Google and Anthropic have built safeguards—Gemini prompts users to verify info, while Claude redirects queries to professionals—but these may not prevent initial errors.

Despite limitations, AI could still aid underserved areas with scarce medical access. Sanjay Kinra of the London School of Hygiene & Tropical Medicine suggested specialized models like Google’s Articulate Medical Intelligence Explorer (AMIE) might bridge gaps in remote regions. However, he stressed that AI cannot replicate the nuanced judgment of physical exams or patient interaction. For investors, this signals a need for caution in betting on general-purpose chatbots for clinical use. The tech may evolve, but current systems require human validation. Regulatory scrutiny is likely to follow as real-world risks emerge.