HeadlinesBriefing favicon HeadlinesBriefing.com

SWE-bench AI PRs Fail Human Code Review at High Rate

Hacker News •
×

A new study reveals that roughly half of AI-generated pull requests that pass automated testing would not be merged by human maintainers. Researchers had maintainers from scikit-learn, Sphinx, and pytest review 296 AI-generated PRs alongside 47 human-written patches. The findings suggest that benchmark scores may overstate real-world usefulness.

Maintainers rejected patches at rates 24 percentage points lower than automated graders, with improvement rates 9.6 percentage points per year slower for human review. The study used maintainers from 3 of 12 SWE-bench Verified repositories, covering 95 issues. Researchers note that AI agents don't get iteration opportunities like human developers do.

Anthropic models like Claude 3.5 Sonnet and Claude 4 Opus were the primary focus since they've dominated SWE-bench Verified rankings. The study emphasizes that these results don't indicate fundamental AI limitations, but rather highlight the gap between automated testing and real-world code review standards. Researchers caution against interpreting benchmark scores as direct measures of practical utility.