HeadlinesBriefing favicon HeadlinesBriefing.com

Why 53 AI Models Fail a Simple Car Wash Logic Test

Hacker News •
×

Researchers tested 53 models with a one-step logic puzzle: should you walk or drive 50 meters to a car wash? On a single try, 42 models incorrectly chose walk, missing that the car must be driven to the wash. Only 11 got it right, including GPT-5 and Claude Opus 4.6. The wrong answers fixated on distance efficiency, ignoring the core constraint.

When run 10 times each, results worsened. Just five models—Claude Opus 4.6, three Gemini variants, and Grok-4—answered correctly every time. GPT-5 succeeded 70% of the time. A human baseline of 10,000 people showed 71.5% chose drive, outperforming 48 of the 53 AI models. Many models were inconsistent, flipping answers despite identical prompts.

This exposes a critical reliability gap. The test requires overriding a learned "short distance equals walk" heuristic with contextual reasoning. Models that sometimes succeed present the greatest production risk, as they can pass evaluation then fail unpredictably. While a toy problem, it reveals that fundamental logical consistency remains elusive for most leading AI, even on tasks with a single, obvious inference step.