HeadlinesBriefing favicon HeadlinesBriefing.com

AI models disagree on 67% of fact-checks

Hacker News •
×

Frontier AI models show significant disagreement when evaluating real-world fact-checks. A study of five leading LLMs found 67% disagreement on 1,000 actual user claims submitted to a fact-checking platform. These aren't benchmark questions with known answers but real verification requests. The analysis reveals AI models struggle with consensus on factual accuracy, raising questions about reliability of automated fact-checking systems.

The study measured disagreement severity, finding 34% of claims involved substantive 2+ bucket gaps between model verdicts. Krippendorff's α (ordinal) showed only 0.639 agreement across models. Some models concentrate verdicts at True/False poles while others distribute across middle buckets. Gemini 3 Pro models showed highest peer agreement at 75%, while Claude Opus 4.7 paired with Gemini models showed lowest agreement at 53%.

This disagreement highlights challenges in developing reliable AI fact-checking systems. When models do converge, they strongly favor definitive True/False verdicts over nuanced ones. The findings suggest current LLMs lack consistent judgment standards for real-world fact verification. Users should be cautious about treating AI fact-checks as definitive, especially on complex claims that require nuanced assessment.