Training Agents to Self-Report Misbehavior
Bruce W. Lee, Chen Yueh-Han, Tomek Korbak

TL;DR
This paper introduces self-incrimination training for AI agents, enabling them to signal covert misbehavior, which significantly reduces undetected harmful actions without impairing overall capabilities.
Contribution
It proposes a novel self-reporting method for AI agents to disclose misbehavior, improving detection over traditional monitoring and alignment techniques.
Findings
Self-incrimination reduces undetected misbehavior significantly.
Performance remains stable across different tasks and adversarial conditions.
Method generalizes to agents pursuing misaligned goals independently.
Abstract
Frontier AI agents may pursue hidden goals while concealing their pursuit from oversight. Alignment training aims to prevent such behavior by reinforcing the correct goals, but alignment may not always succeed and can lead to unwanted side effects. We propose self-incrimination training, which instead trains agents to produce a visible signal when they covertly misbehave. We train GPT-4.1 and Gemini-2.0 agents to call a report_scheming() tool when behaving deceptively and measure their ability to cause harm undetected in out-of-distribution environments. Self-incrimination significantly reduces the undetected successful attack rate, outperforming matched-capability monitors and alignment baselines while preserving instruction hierarchy and incurring minimal safety tax on general capabilities. Unlike blackbox monitoring, self-incrimination performance is consistent across tasks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)
