Procedural Dilemma Generation for Evaluating Moral Reasoning in Humans and Language Models
Jan-Philipp Fr\"anken, Kanishk Gandhi, Tori Qiu, Ayesha Khawaja, Noah, D. Goodman, Tobias Gerstenberg

TL;DR
This paper introduces a framework for generating diverse moral dilemmas to evaluate moral reasoning in humans and AI, revealing patterns in judgments and highlighting areas for improvement in scenario design.
Contribution
It presents a novel procedural generation method for moral dilemmas and a large benchmark dataset for assessing moral reasoning in language models.
Findings
Both humans and models rated harm as less permissible when it was a necessary means.
Harmful outcomes that are evitable received higher permissibility ratings.
No significant difference based on whether harm resulted from action or omission.
Abstract
As AI systems like language models are increasingly integrated into decision-making processes affecting people's lives, it's critical to ensure that these systems have sound moral reasoning. To test whether they do, we need to develop systematic evaluations. We provide a framework that uses a language model to translate causal graphs that capture key aspects of moral dilemmas into prompt templates. With this framework, we procedurally generated a large and diverse set of moral dilemmas -- the OffTheRails benchmark -- consisting of 50 scenarios and 400 unique test items. We collected moral permissibility and intention judgments from human participants for a subset of our items and compared these judgments to those from two language models (GPT-4 and Claude-2) across eight conditions. We find that moral dilemmas in which the harm is a necessary means (as compared to a side effect)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsSparse Evolutionary Training
