HeadlinesBriefing favicon HeadlinesBriefing.com

AI Models Struggle to Patch Real CVEs in New Benchmark

Hacker News •
×

I built a new benchmark called CVE-Bench to see whether frontier LLMs can actually repair real security flaws. The suite runs five models—three from OpenAI and two from Poolside—against 20 recent Python CVEs under three prompt conditions: full advisory, behavior‑only description, and file‑location only. No model consistently succeeds. The evaluation runs each model in an isolated Docker container, measuring both correctness and compute cost.

Testing revealed a modest ceiling: the top performer, gpt-5.5, fixed 50 % of the bugs overall and 60 % when supplied the full advisory. After correcting five faulty security tests, solve rates rose 3–7 points but the ranking stayed the same. All cross‑family pairwise comparisons achieved statistical significance (p ≤ 0.04) under McNemar with continuity correction. The benchmark records token usage, revealing prompts do not guarantee performance.

Failure modes clustered around wrong‑search drift, token‑budget exhaustion, and partial patches that left the test suite failing. Token consumption varied up to fourfold for comparable outcomes, making cost unpredictable. The locate‑only prompt proved the toughest, exposing every model’s weakness when no description of the flaw is given. The study shows current AI still cannot be trusted to autonomously remediate vulnerabilities.