When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment
Zhijing Jin, Sydney Levine, Fernando Gonzalez, Ojasv Kamal, Maarten, Sap, Mrinmaya Sachan, Rada Mihalcea, Josh Tenenbaum, Bernhard Sch\"olkopf

TL;DR
This paper introduces a new challenge set and a prompting strategy for large language models to better predict human moral judgments, especially in rule-breaking scenarios, enhancing AI safety.
Contribution
It presents a novel moral reasoning prompting method (MORALCOT) and a rule-breaking question answering dataset, improving LLM performance in modeling human moral flexibility.
Findings
MORALCOT outperforms seven LLMs by 6.2% F1
The approach better captures human moral reasoning in rule-breaking cases
Open-sourced dataset and code for further research
Abstract
AI systems are becoming increasingly intertwined with human life. In order to effectively collaborate with humans and ensure safety, AI systems need to be able to understand, interpret and predict human moral judgments and decisions. Human moral judgments are often guided by rules, but not always. A central challenge for AI safety is capturing the flexibility of the human moral mind -- the ability to determine when a rule should be broken, especially in novel or unusual situations. In this paper, we present a novel challenge set consisting of rule-breaking question answering (RBQA) of cases that involve potentially permissible rule-breaking -- inspired by recent moral psychology studies. Using a state-of-the-art large language model (LLM) as a basis, we propose a novel moral chain of thought (MORALCOT) prompting strategy that combines the strengths of LLMs with theories of moral…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI
