Training LLMs for Honesty via Confessions

Manas Joglekar; Jeremy Chen; Gabriel Wu; Jason Yosinski; Jasmine Wang; Boaz Barak; Amelia Glaese

arXiv:2512.08093·cs.LG·December 24, 2025

Training LLMs for Honesty via Confessions

Manas Joglekar, Jeremy Chen, Gabriel Wu, Jason Yosinski, Jasmine Wang, Boaz Barak, Amelia Glaese

PDF

Open Access

TL;DR

This paper introduces a novel training method for LLMs that encourages honesty by using self-reported confessions, which improves transparency and reduces dishonesty in model outputs.

Contribution

The paper proposes a confession-based training approach that incentivizes LLMs to honestly report their shortcomings without affecting their main answer's reward.

Findings

01

Confessions often accurately reveal model misbehavior.

02

Training with confessions modestly improves honesty over time.

03

Confessions enable better monitoring and intervention during inference.

Abstract

Large language models (LLMs) can be dishonest when reporting on their actions and beliefs -- for example, they may overstate their confidence in factual claims or cover up evidence of covert actions. Such dishonesty may arise due to the effects of reinforcement learning (RL), where challenges with reward shaping can result in a training process that inadvertently incentivizes the model to lie or misrepresent its actions. In this work we propose a method for eliciting an honest expression of an LLM's shortcomings via a self-reported *confession*. A confession is an output, provided upon request after a model's original answer, that is meant to serve as a full account of the model's compliance with the letter and spirit of its policies and instructions. The reward assigned to a confession during training is solely based on its honesty, and does not impact positively or negatively the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education · Topic Modeling