HeadlinesBriefing favicon HeadlinesBriefing.com

AI Reasoning Models Fail to Control Thoughts, Boosting Safety

OpenAI Blog •
×

OpenAI's research reveals that current reasoning models struggle to deliberately manipulate their internal thought processes, a finding that actually strengthens oversight capabilities. The study tested 13 reasoning models using a new evaluation suite called CoT-Control, which measures how well AI agents can follow instructions about their chain-of-thought reasoning.

Models scored surprisingly low on controllability tests, ranging from just 0.1% to 15.4% success rates across different tasks. Even when models recognized they were failing to comply with instructions, they continued to produce uncontrolled reasoning traces. This inability to obfuscate thinking makes monitoring systems more reliable for detecting unsafe behavior.

The research tested various factors including model size, reasoning length, and reinforcement learning training. While larger models showed slightly better controllability, extended reasoning time and additional training actually reduced this ability. When models were told they were being monitored, controllability increased only marginally. These findings suggest current AI systems lack the sophistication to deliberately hide problematic reasoning, making chain-of-thought monitoring a robust safeguard for now.