Smaller Large Language Models Can Do Moral Self-Correction

Guangliang Liu; Zhiyu Xue; Xitong Zhang; Rongrong Wang; Kristen Marie; Johnson

arXiv:2410.23496·cs.CL·March 4, 2025

Smaller Large Language Models Can Do Moral Self-Correction

Guangliang Liu, Zhiyu Xue, Xitong Zhang, Rongrong Wang, Kristen Marie, Johnson

PDF

Open Access 1 Video

TL;DR

This paper demonstrates that smaller LLMs, specifically around 3.8B parameters, can effectively perform moral self-correction when properly safety aligned, challenging prior assumptions about their limitations.

Contribution

The study empirically shows that small, safety-aligned LLMs can achieve strong moral self-correction, highlighting the importance of safety alignment over model size.

Findings

01

3.8B LLMs can perform effective moral self-correction with proper safety alignment.

02

Smaller models are weaker in understanding social norms and self-explanation.

03

All model sizes perform poorly in self-correction when given unethical instructions.

Abstract

Self-correction is one of the most amazing emerging capabilities of Large Language Models (LLMs), enabling LLMs to self-modify an inappropriate output given a natural language feedback which describes the problems of that output. Moral self-correction is a post-hoc approach correcting unethical generations without requiring a gradient update, making it both computationally lightweight and capable of preserving the language modeling ability. Previous works have shown that LLMs can self-debias, and it has been reported that small models, i.e., those with less than 22B parameters, are not capable of moral self-correction. However, there is no direct proof as to why such smaller models fall short of moral self-correction, though previous research hypothesizes that larger models are skilled in following instructions and understanding abstract social norms. In this paper, we empirically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Smaller Large Language Models Can Do Moral Self-Correction· underline

Taxonomy

TopicsTopic Modeling · Misinformation and Its Impacts · Hate Speech and Cyberbullying Detection