Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis
Guangliang Liu, Haitao Mao, Jiliang Tang, Kristen Marie Johnson

TL;DR
This paper investigates how moral self-correction instructions influence large language models, revealing that such corrections often act as shortcuts rather than truly altering the models' internal moral representations.
Contribution
It provides a comprehensive analysis of the internal mechanisms of LLMs during moral self-correction and introduces the hypothesis that intrinsic self-correction is superficial.
Findings
Self-correction improves performance when the correct answer is top-ranked.
Morality levels in hidden states predict instruction effectiveness.
Intrinsic self-correction may be superficial, not reducing internal immorality.
Abstract
Large Language Models (LLMs) are capable of producing content that perpetuates stereotypes, discrimination, and toxicity. The recently proposed moral self-correction is a computationally efficient method for reducing harmful content in the responses of LLMs. However, the process of how injecting self-correction instructions can modify the behavior of LLMs remains under-explored. In this paper, we explore the effectiveness of moral self-correction by answering three research questions: (1) In what scenarios does moral self-correction work? (2) What are the internal mechanisms of LLMs, e.g., hidden states, that are influenced by moral self-correction instructions? (3) Is intrinsic moral self-correction actually superficial in terms of reduced immorality in hidden states? We argue that self-correction can help LLMs find a shortcut to more morally correct output, rather than truly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsPsychology of Moral and Emotional Judgment · Ethics in Business and Education
