Intrinsic Self-Correction in LLMs: Towards Explainable Prompting via Mechanistic Interpretability
Yu-Ting Lee, Fu-Chieh Chang, Yu-En Shu, Hui-Ying Shih, Pei-Yuan Wu

TL;DR
This paper investigates how large language models self-correct through prompts by steering their internal representations along interpretable latent directions, revealing the underlying mechanisms of intrinsic self-correction.
Contribution
It introduces a mechanistic interpretability approach to understand intrinsic self-correction in LLMs, linking prompt-induced representation shifts to latent directions.
Findings
Representation shifts align with toxic and non-toxic directions.
Prompt-induced shifts causally relate to latent directions.
Self-correction mechanisms are driven by internal representation steering.
Abstract
Intrinsic self-correction refers to the phenomenon where a language model refines its own outputs purely through prompting, without external feedback or parameter updates. While this approach improves performance across diverse tasks, its mechanism remains unclear. We show that intrinsic self-correction functions by steering hidden representations along interpretable latent directions, as evidenced by both alignment analysis and activation interventions. To achieve this, we analyze intrinsic self-correction via the representation shift induced by prompting. In parallel, we construct interpretable latent directions with contrastive pairs and verify the causal effect of these directions via activation addition. Evaluating six open-source LLMs, our results demonstrate that prompt-induced representation shifts in text detoxification and text toxification consistently align with latent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Natural Language Processing Techniques
