Unsupervised Elicitation of Moral Values from Language Models
Meysam Alizadeh, Fabrizio Gilardi, Zeynab Samei

TL;DR
This paper demonstrates that unsupervised methods can reveal latent moral reasoning in pretrained language models, outperforming baselines and reducing social bias, thus offering a scalable approach for AI alignment.
Contribution
The study introduces the Internal Coherence Maximization (ICM) algorithm, showing it can reliably elicit moral judgments from pretrained LMs without supervision, outperforming baselines and reducing social biases.
Findings
ICM outperforms all baselines on Norm Bank and ETHICS benchmarks.
Fine-tuning on ICM labels matches or exceeds human-labeled performance.
ICM significantly reduces social bias errors in chatbot LMs.
Abstract
As AI systems become pervasive, grounding their behavior in human values is critical. Prior work suggests that language models (LMs) exhibit limited inherent moral reasoning, leading to calls for explicit moral teaching. However, constructing ground truth data for moral evaluation is difficult given plural frameworks and pervasive biases. We investigate unsupervised elicitation as an alternative, asking whether pretrained (base) LMs possess intrinsic moral reasoning capability that can be surfaced without human supervision. Using the Internal Coherence Maximization (ICM) algorithm across three benchmark datasets and four LMs, we test whether ICM can reliably label moral judgments, generalize across moral frameworks, and mitigate social bias. Results show that ICM outperforms all pre-trained and chatbot baselines on the Norm Bank and ETHICS benchmarks, while fine-tuning on ICM labels…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Artificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI)
