Intrinsic Self-Correction in LLMs: Towards Explainable Prompting via Mechanistic Interpretability

Yu-Ting Lee; Fu-Chieh Chang; Yu-En Shu; Hui-Ying Shih; Pei-Yuan Wu

arXiv:2505.11924·cs.CL·February 12, 2026

Intrinsic Self-Correction in LLMs: Towards Explainable Prompting via Mechanistic Interpretability

Yu-Ting Lee, Fu-Chieh Chang, Yu-En Shu, Hui-Ying Shih, Pei-Yuan Wu

PDF

Open Access 1 Video

TL;DR

This paper investigates how large language models self-correct through prompts by steering their internal representations along interpretable latent directions, revealing the underlying mechanisms of intrinsic self-correction.

Contribution

It introduces a mechanistic interpretability approach to understand intrinsic self-correction in LLMs, linking prompt-induced representation shifts to latent directions.

Findings

01

Representation shifts align with toxic and non-toxic directions.

02

Prompt-induced shifts causally relate to latent directions.

03

Self-correction mechanisms are driven by internal representation steering.

Abstract

Intrinsic self-correction refers to the phenomenon where a language model refines its own outputs purely through prompting, without external feedback or parameter updates. While this approach improves performance across diverse tasks, its mechanism remains unclear. We show that intrinsic self-correction functions by steering hidden representations along interpretable latent directions, as evidenced by both alignment analysis and activation interventions. To achieve this, we analyze intrinsic self-correction via the representation shift induced by prompting. In parallel, we construct interpretable latent directions with contrastive pairs and verify the causal effect of these directions via activation addition. Evaluating six open-source LLMs, our results demonstrate that prompt-induced representation shifts in text detoxification and text toxification consistently align with latent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Intrinsic Self-Correction in LLMs: Towards Explainable Prompting via Mechanistic Interpretability· underline

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Natural Language Processing Techniques