HeadlinesBriefing favicon HeadlinesBriefing.com

OpenAI Abandons SWE-bench Verified: Benchmark Contamination Undermines AI Coding Tests

OpenAI Blog •
×

OpenAI has stopped using SWE-bench Verified to measure AI coding capabilities, citing widespread contamination that undermines its reliability. The benchmark, once considered a gold standard for evaluating autonomous software engineering, has become increasingly polluted by training data overlap and flawed test cases.

Since its release in August 2024, SWE-bench Verified showed minimal progress, improving only from 74.9% to 80.9% over six months. OpenAI's analysis revealed two critical failures: 59.4% of audited problems contain flawed test cases that reject correct solutions, and all frontier models tested had seen some problems during training. This means improvements now reflect memorization rather than genuine capability gains.

The company found that models exposed to benchmark problems during training are more likely to succeed because they have additional information needed to pass underspecified tests. OpenAI recommends developers abandon SWE-bench Verified and instead use SWE-bench Pro, which remains uncontaminated. The company is developing new evaluations to better track coding capabilities and urges the research community to focus on this critical measurement challenge.