Self-Attribution Bias: When AI Monitors Go Easy on Themselves

Dipika Khullar; Jack Hopkins; Rowan Wang; Fabien Roger

arXiv:2603.04582·cs.AI·March 6, 2026

Self-Attribution Bias: When AI Monitors Go Easy on Themselves

Dipika Khullar, Jack Hopkins, Rowan Wang, Fabien Roger

PDF

Open Access

TL;DR

This paper uncovers a self-attribution bias in AI monitors, showing they tend to underestimate risks or overestimate correctness when evaluating their own previous actions, which can mislead deployment reliability.

Contribution

It identifies and characterizes self-attribution bias in AI monitoring systems, revealing how context affects their evaluation accuracy and highlighting potential deployment risks.

Findings

01

Monitors are less likely to report high-risk actions when evaluated after their own previous outputs.

02

Explicitly stating the action's source does not induce self-attribution bias.

03

Evaluation on fixed examples can overstate monitor reliability in real deployment.

Abstract

Agentic systems increasingly rely on language models to monitor their own behavior. For example, coding agents may self critique generated code for pull request approval or assess the safety of tool-use actions. We show that this design pattern can fail when the action is presented in a previous or in the same assistant turn instead of being presented by the user in a user turn. We define self-attribution bias as the tendency of a model to evaluate an action as more correct or less risky when the action is implicitly framed as its own, compared to when the same action is evaluated under off-policy attribution. Across four coding and tool-use datasets, we find that monitors fail to report high-risk or low-correctness actions more often when evaluation follows a previous assistant turn in which the action was generated, compared to when the same action is evaluated in a new context…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEthics and Social Impacts of AI · Multi-Agent Systems and Negotiation · Explainable Artificial Intelligence (XAI)