HeadlinesBriefing favicon HeadlinesBriefing.com

PIGuard: New Prompt Injection Guardrail Tackles Over-Defense Bias

Hacker News •
×

PIGuard emerges as a lightweight, open-source solution designed to defend large language models (LLMs) against prompt injection attacks while directly addressing the critical flaw of over-defense. Prompt injection attacks hijack LLM behavior, but existing guard models often suffer from over-defense, falsely flagging benign inputs as malicious due to trigger word bias. To measure this issue, researchers from Washington University in St. Louis and the University of Wisconsin-Madison created the NotInject dataset, containing 339 benign samples enriched with common injection trigger words. This dataset revealed that state-of-the-art models exhibit severe over-defense, with accuracy plummeting near random guessing levels (60%).

To counter this, the team developed PIGuard, a novel guard model employing a new training strategy called Mitigating Over-defense for Free (MOF). MOF significantly reduces trigger word bias, enabling PIGuard to outperform existing models like PromptGuard and ProtectAIv2 on diverse benchmarks including NotInject. Despite its compact size (184MB parameters), PIGuard delivers robust performance comparable to advanced commercial LLMs like GPT-4, offering a practical defense without the crippling false positives of prior solutions.

The researchers emphasize that all training details, code, and datasets are fully open-sourced, making PIGuard a readily available tool for developers seeking to enhance LLM security against prompt injection attacks while minimizing unnecessary blocking of legitimate user interactions.