Discourse Heuristics For Paradoxically Moral Self-Correction
Guangliang Liu, Zimo Qi, Xitong Zhang, Kristen Marie Johnson

TL;DR
This paper investigates the discourse heuristics underlying moral self-correction in LLMs, revealing that reliance on heuristic shortcuts causes paradoxes and proposing dataset heuristics to improve moral self-correction.
Contribution
It uncovers the discourse heuristics in moral self-correction and proposes leveraging curated dataset heuristics to address paradoxes and improve LLM moral alignment.
Findings
Heuristic shortcuts underpin effective moral self-correction.
Presence of heuristics causes inconsistency in joint self-correction and self-diagnosis.
Challenges in generalization due to context and model scale.
Abstract
Moral self-correction has emerged as a promising approach for aligning the output of Large Language Models (LLMs) with human moral values. However, moral self-correction techniques are subject to two primary paradoxes. First, despite empirical and theoretical evidence to support the effectiveness of self-correction, this LLM capability only operates at a superficial level. Second, while LLMs possess the capability of self-diagnosing immoral aspects of their output, they struggle to identify the cause of this moral inconsistency during their self-correction process. To better understand and address these paradoxes, we analyze the discourse constructions in fine-tuning corpora designed to enhance moral self-correction, uncovering the existence of the heuristics underlying effective constructions. We demonstrate that moral self-correction relies on discourse constructions that reflect…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Explainable Artificial Intelligence (XAI)
