HeadlinesBriefing favicon HeadlinesBriefing.com

LLM Security Benchmark Tests Real Vulnerabilities

Hacker News •
×

N-Day-Bench introduces a novel benchmark measuring whether frontier language models can identify real vulnerabilities in codebases. Created by Winfunc Research, this monthly-updated test pulls fresh security cases directly from GitHub advisories. The benchmark aims to measure genuine cybersecurity capabilities beyond memorized training data, addressing the critical problem of benchmark contamination as models improve.

The methodology employs a three-agent system: a Curator builds answer keys from advisories, a Finder model explores code via sandboxed bash shell, and a Judge scores blinded submissions. Only repositories with 10k+ stars qualify, with diversity measures preventing any single repo from dominating. Models start from sink hints and trace bugs through actual code without seeing patches.

Currently evaluating GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, GLM-5.1, and Kimi K2.5, the benchmark uses monthly refresh to stay ahead of training contamination. All traces remain publicly browsable, allowing researchers to examine model performance on real-world cases. This approach provides unprecedented transparency in assessing AI security capabilities.