Introspection Adapters: Training LLMs to Report Their Learned Behaviors

Keshav Shenoy; Li Yang; Abhay Sheshadri; S\"oren Mindermann; Jack Lindsey; Sam Marks; Rowan Wang

arXiv:2604.16812·cs.AI·April 29, 2026

Introspection Adapters: Training LLMs to Report Their Learned Behaviors

Keshav Shenoy, Li Yang, Abhay Sheshadri, S\"oren Mindermann, Jack Lindsey, Sam Marks, Rowan Wang

PDF

TL;DR

This paper introduces introspection adapters (IAs), a scalable method for fine-tuned LLMs to self-report learned behaviors, aiding in model auditing and detection of harmful or hidden behaviors.

Contribution

The paper proposes a novel approach using a single LoRA adapter trained across multiple finetuned models to enable natural language self-description of behaviors.

Findings

01

IAs achieve state-of-the-art detection of hidden behaviors.

02

IAs generalize across models trained differently.

03

Scaling IAs improves behavior reporting accuracy.

Abstract

When model developers or users fine-tune an LLM, this can induce behaviors that are unexpected, deliberately harmful, or hard to detect. It would be far easier to audit LLMs if they could simply describe their behaviors in natural language. Here, we study a scalable approach to rapidly identify learned behaviors of many LLMs derived from a shared base LLM. Given a model $M$ , our method works by finetuning models $M_{i}$ from $M$ with implanted behaviors $b_{i}$ ; the $(M_{i}, b_{i})$ pairs serve as labeled training data. We then train an introspection adapter (IA): a single LoRA adapter jointly trained across the finetunes $M_{i}$ to cause them to verbalize their implanted behaviors. We find that this IA induces self-description of learned behaviors even in finetunes of $M$ that were trained in very different ways from the $M_{i}$ . For example, IAs generalize to AuditBench, achieving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.