Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal   Mechanisms and the Superficial Hypothesis

Guangliang Liu; Haitao Mao; Jiliang Tang; Kristen Marie Johnson

arXiv:2407.15286·cs.CL·October 10, 2024

Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis

Guangliang Liu, Haitao Mao, Jiliang Tang, Kristen Marie Johnson

PDF

Open Access 1 Video

TL;DR

This paper investigates how moral self-correction instructions influence large language models, revealing that such corrections often act as shortcuts rather than truly altering the models' internal moral representations.

Contribution

It provides a comprehensive analysis of the internal mechanisms of LLMs during moral self-correction and introduces the hypothesis that intrinsic self-correction is superficial.

Findings

01

Self-correction improves performance when the correct answer is top-ranked.

02

Morality levels in hidden states predict instruction effectiveness.

03

Intrinsic self-correction may be superficial, not reducing internal immorality.

Abstract

Large Language Models (LLMs) are capable of producing content that perpetuates stereotypes, discrimination, and toxicity. The recently proposed moral self-correction is a computationally efficient method for reducing harmful content in the responses of LLMs. However, the process of how injecting self-correction instructions can modify the behavior of LLMs remains under-explored. In this paper, we explore the effectiveness of moral self-correction by answering three research questions: (1) In what scenarios does moral self-correction work? (2) What are the internal mechanisms of LLMs, e.g., hidden states, that are influenced by moral self-correction instructions? (3) Is intrinsic moral self-correction actually superficial in terms of reduced immorality in hidden states? We argue that self-correction can help LLMs find a shortcut to more morally correct output, rather than truly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis· underline

Taxonomy

TopicsPsychology of Moral and Emotional Judgment · Ethics in Business and Education