HeadlinesBriefing favicon HeadlinesBriefing.com

Study: LLMs Corrupt 25% of Documents in Delegated Workflows

Hacker News •
×

A new research benchmark called DELEGATE-52 reveals a troubling pattern in how large language models handle delegated work. The study tested 19 LLMs across 52 professional domains including coding, crystallography, and music notation. Researchers found that even the most advanced models systematically degrade documents during long workflows, introducing errors that compound over time.

Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 were among the frontier models evaluated. The results showed these top-tier systems corrupted approximately 25% of document content by the end of extended delegated tasks. Degradation worsened with larger documents, longer interaction periods, and when distractor files were present. Agentic tool use failed to improve outcomes, suggesting the core problem lies in how these models handle iterative document modifications.

The errors weren't random noise—they were sparse but severe, silently corrupting documents in ways that might escape casual review. This poses a significant challenge for the growing "vibe coding" trend, where developers hand off increasingly complex tasks to AI. Trust in delegated workflows depends on faithful execution, but current models fundamentally undermine that expectation.