Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives

Chloe Li; Mary Phuong; Daniel Tan

arXiv:2511.06626·cs.AI·March 24, 2026

Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives

Chloe Li, Mary Phuong, Daniel Tan

PDF

Open Access 3 Reviews

TL;DR

This paper introduces self-report fine-tuning (SRFT), a method that trains language models to admit factual mistakes and disclose hidden objectives, improving detection of misaligned AI behaviors in adversarial settings.

Contribution

The paper presents SRFT, a novel fine-tuning approach that enhances models' honesty and ability to reveal hidden objectives, especially in out-of-distribution adversarial scenarios.

Findings

01

SRFT increases models' likelihood to admit factual errors.

02

SRFT enables near-perfect detection of hidden objectives.

03

SRFT recovers 28-100% of hidden objective details.

Abstract

As AI systems become more capable of complex agentic tasks, they also become more capable of pursuing undesirable objectives and causing harm. Previous work has attempted to catch these unsafe instances by interrogating models directly about their objectives and behaviors. However, the main weakness of trusting interrogations is that models can lie. We propose self-report fine-tuning (SRFT), a simple supervised fine-tuning technique that trains models to occasionally make factual mistakes, then admit them when asked. We show that the admission of factual errors in simple question-answering settings generalizes out-of-distribution (OOD) to the admission of hidden misaligned objectives in adversarial agentic settings. We evaluate SRFT in OOD stealth tasks, where models are instructed to complete a hidden misaligned objective alongside a user-specified objective without being caught by…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 5

Strengths

1. The paper introduces and shows the effectiveness of a novel method to make models more honest during auditing, an important open question in AI safety. 2. The experiments support the claims and explore the topic well. 3. The proposed technique is actionable and cheap to apply to frontier models. 4. The paper is well-written, clear, and easy to follow.

Weaknesses

1. The hidden objective is provided in context, which might not be a suitable proxy for models with actual hidden objectives. 2. SRFT was not red-teamed thoroughly. To characterise the strength of SRFT, the system prompts used for the stealth task could be optimised to decrease detection rates. 3. There is a significant capability loss in the model (10-20% loss in all benchmarks shown).

Reviewer 02Rating 8Confidence 3

Strengths

S1: The core idea of SRFT is remarkably simple yet insightful — by training the model on a small dataset (770 samples) to “admit errors,” it generalizes to completely different domains (multi-turn agentic tasks with hidden objectives) and induces self-disclosure of hidden intentions. This offers a novel perspective on LLM interpretability and alignment. S2: The overall pipeline, dataset construction, evaluation metrics, and baselines (prefilled attack, baseline LLM, Claude comparison) are well-

Weaknesses

W1: The mechanism behind SRFT’s generalization to “hidden objective disclosure” lacks theoretical explanation; the paper remains largely empirical without cognitive or representational analysis. W2: The training data only contains factual error-admission samples, leading to a large domain gap. The authors should include cross-domain error-admission tasks to verify robustness. W3: Although decoy objectives are tested, the adversarial coverage is limited — stronger multi-layer deception or multi

Reviewer 03Rating 4Confidence 4

Strengths

- Well written and easy to read - Important topic with real-world applications

Weaknesses

- Beginning of section 2.1.1 reads like a related works section - Did you try investigating why is there such a big difference between the performance of different categories in Figure 2? - Did you try investigating why is there such a difference between the performance of different tasks in figure 3? - It seems that results are only on GPT 4.1; since the main contribution is the new method, it would strengthen the work by evaluating it on a broader set of models - How important is the exact for

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)