HeadlinesBriefing favicon HeadlinesBriefing.com

SIR-Bench: New Benchmark Tests Security Agents' Investigation Skills

Hacker News •
×

Researchers have introduced SIR-Bench, a benchmark of 794 test cases designed to evaluate whether autonomous security incident response agents actually conduct forensic investigations or merely parrot alerts. Derived from 129 anonymized real-world incident patterns with expert-validated ground truth, the benchmark measures both triage decision accuracy and the discovery of novel evidence through active investigation.

The team developed the Once Upon A Threat (OUAT) framework to replay genuine incident patterns in controlled cloud environments, producing authentic telemetry with measurable outcomes. SIR-Bench evaluates agents across three complementary metrics: triage accuracy, novel finding discovery, and tool usage appropriateness. An adversarial LLM-as-Judge inverts the traditional burden of proof—requiring concrete forensic evidence before crediting an investigation as successful.

Testing their own SIR agent on the benchmark achieved 97.1% true positive detection, 73.4% false positive rejection, and uncovered 5.67 novel key findings per case. This establishes the first rigorous baseline for measuring investigation depth in autonomous security agents, distinguishing between tools that reach correct conclusions through genuine analysis versus those that get lucky.