HeadlinesBriefing favicon HeadlinesBriefing.com

LLMs Fail Security Tests in $1,500 Experiment

Hacker News •
×

A security researcher built a vulnerable React Native app to test whether large language models could identify Firebase vulnerability issues common in real applications. The experiment cost $1,500 to evaluate multiple LLMs' abilities to find exploits in the intentionally insecure app design.

GPT-5.5 performed best with 57% success rate, while DeepSeek V4 Pro proved most cost-effective at $0.19 per run. Chinese models showed greater comfort attacking databases directly, while Western models frequently hesitated due to security guardrails, with Gemini refusing immediately in all tests.

The research demonstrates significant limitations in current AI security testing capabilities. Models struggled with identifying the specific Firebase vulnerability pattern despite clear instructions, suggesting LLMs remain unreliable for comprehensive penetration testing without significant human oversight.