Self-correction is Not An Innate Capability in Language Models
Guangliang Liu, Zimo Qi, Xitong Zhang, Lu Cheng, Kristen Marie Johnson

TL;DR
This paper investigates whether moral self-correction is an innate ability of large language models, finding that LLMs lack moral sensitivity and cannot effectively use external feedback for self-correction.
Contribution
It provides a comprehensive analysis combining behavioral and mechanistic studies to show that moral self-correction is not an innate capability of LLMs.
Findings
LLMs are not morally sensitive.
External feedback does not significantly improve self-correction.
Moral self-correction is not an inherent LLM capability.
Abstract
Although there has been growing interest in the self-correction capability of Large Language Models (LLMs), there are varying conclusions about its effectiveness. Prior research has largely concentrated on intrinsic self-correction, extrinsic self-correction, particularly the interplay between internal knowledge and external feedback, remains underexplored. In this paper, we aim to comprehensively investigate the underlying mechanism of moral self-correction by addressing a fundamental question: is moral self-correction an innate capability of LLMs? Specifically, we conduct: (1) a behavioral analysis of LLMs' moral sensitivity based on a self-distinguishing task; and (2) a mechanistic analysis of the hidden states to examine how key components of self-correction, such as Chain-of-Thought (CoT) and external feedback, interact to facilitate moral self-correction. Drawing on empirical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Hate Speech and Cyberbullying Detection
