Discourse Heuristics For Paradoxically Moral Self-Correction

Guangliang Liu; Zimo Qi; Xitong Zhang; Kristen Marie Johnson

arXiv:2507.00985·cs.CL·November 4, 2025

Discourse Heuristics For Paradoxically Moral Self-Correction

Guangliang Liu, Zimo Qi, Xitong Zhang, Kristen Marie Johnson

PDF

Open Access 1 Video

TL;DR

This paper investigates the discourse heuristics underlying moral self-correction in LLMs, revealing that reliance on heuristic shortcuts causes paradoxes and proposing dataset heuristics to improve moral self-correction.

Contribution

It uncovers the discourse heuristics in moral self-correction and proposes leveraging curated dataset heuristics to address paradoxes and improve LLM moral alignment.

Findings

01

Heuristic shortcuts underpin effective moral self-correction.

02

Presence of heuristics causes inconsistency in joint self-correction and self-diagnosis.

03

Challenges in generalization due to context and model scale.

Abstract

Moral self-correction has emerged as a promising approach for aligning the output of Large Language Models (LLMs) with human moral values. However, moral self-correction techniques are subject to two primary paradoxes. First, despite empirical and theoretical evidence to support the effectiveness of self-correction, this LLM capability only operates at a superficial level. Second, while LLMs possess the capability of self-diagnosing immoral aspects of their output, they struggle to identify the cause of this moral inconsistency during their self-correction process. To better understand and address these paradoxes, we analyze the discourse constructions in fine-tuning corpora designed to enhance moral self-correction, uncovering the existence of the heuristics underlying effective constructions. We demonstrate that moral self-correction relies on discourse constructions that reflect…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Discourse Heuristics For Paradoxically Moral Self-Correction· underline

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Explainable Artificial Intelligence (XAI)