HeadlinesBriefing favicon HeadlinesBriefing.com

DeepSWE Benchmarks Challenge Contaminated Coding Agent Tests

Hacker News •
×

Researchers released DeepSWE, a contamination-free benchmark for evaluating long-horizon coding agents. Built from 113 original tasks across 91 repositories in five languages, it deliberately avoids pulling from existing GitHub commits or PRs. Prompts run half the length of SWE-bench Pro ones, yet solutions demand 5.5x more code, testing genuine problem-solving rather than memorized patches.

The team audited SWE-bench Pro and found its verifier misgrades outputs at roughly 8% false positives and 24% false negatives. Cross-checking with AI judges showed 32% of SWE-bench Pro verdicts conflicted with human analysis, versus just 1.4% for DeepSWE. Verifiers score observable behavior instead of matching a reference implementation.

DeepSWE's stricter grading exposed wide gaps between frontier models that public benchmarks had masked. No repo dominates the leaderboard, and tasks never merge upstream, keeping contamination risk minimal. The benchmark gives developers a clearer picture of which agents actually handle real-world codebases.