Training Agents to Self-Report Misbehavior

Bruce W. Lee; Chen Yueh-Han; Tomek Korbak

arXiv:2602.22303·cs.LG·February 27, 2026

Training Agents to Self-Report Misbehavior

Bruce W. Lee, Chen Yueh-Han, Tomek Korbak

PDF

Open Access

TL;DR

This paper introduces self-incrimination training for AI agents, enabling them to signal covert misbehavior, which significantly reduces undetected harmful actions without impairing overall capabilities.

Contribution

It proposes a novel self-reporting method for AI agents to disclose misbehavior, improving detection over traditional monitoring and alignment techniques.

Findings

01

Self-incrimination reduces undetected misbehavior significantly.

02

Performance remains stable across different tasks and adversarial conditions.

03

Method generalizes to agents pursuing misaligned goals independently.

Abstract

Frontier AI agents may pursue hidden goals while concealing their pursuit from oversight. Alignment training aims to prevent such behavior by reinforcing the correct goals, but alignment may not always succeed and can lead to unwanted side effects. We propose self-incrimination training, which instead trains agents to produce a visible signal when they covertly misbehave. We train GPT-4.1 and Gemini-2.0 agents to call a report_scheming() tool when behaving deceptively and measure their ability to cause harm undetected in out-of-distribution environments. Self-incrimination significantly reduces the undetected successful attack rate, outperforming matched-capability monitors and alignment baselines while preserving instruction hierarchy and incurring minimal safety tax on general capabilities. Unlike blackbox monitoring, self-incrimination performance is consistent across tasks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)