Self-correction is Not An Innate Capability in Language Models

Guangliang Liu; Zimo Qi; Xitong Zhang; Lu Cheng; Kristen Marie Johnson

arXiv:2410.20513·cs.CL·January 23, 2026

Self-correction is Not An Innate Capability in Language Models

Guangliang Liu, Zimo Qi, Xitong Zhang, Lu Cheng, Kristen Marie Johnson

PDF

Open Access

TL;DR

This paper investigates whether moral self-correction is an innate ability of large language models, finding that LLMs lack moral sensitivity and cannot effectively use external feedback for self-correction.

Contribution

It provides a comprehensive analysis combining behavioral and mechanistic studies to show that moral self-correction is not an innate capability of LLMs.

Findings

01

LLMs are not morally sensitive.

02

External feedback does not significantly improve self-correction.

03

Moral self-correction is not an inherent LLM capability.

Abstract

Although there has been growing interest in the self-correction capability of Large Language Models (LLMs), there are varying conclusions about its effectiveness. Prior research has largely concentrated on intrinsic self-correction, extrinsic self-correction, particularly the interplay between internal knowledge and external feedback, remains underexplored. In this paper, we aim to comprehensively investigate the underlying mechanism of moral self-correction by addressing a fundamental question: is moral self-correction an innate capability of LLMs? Specifically, we conduct: (1) a behavioral analysis of LLMs' moral sensitivity based on a self-distinguishing task; and (2) a mechanistic analysis of the hidden states to examine how key components of self-correction, such as Chain-of-Thought (CoT) and external feedback, interact to facilitate moral self-correction. Drawing on empirical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Hate Speech and Cyberbullying Detection