MoralReason: Generalizable Moral Decision Alignment For LLM Agents Using Reasoning-Level Reinforcement Learning
Zhiyu An, Wan Du

TL;DR
This paper introduces a method to train large language models to consistently apply specific moral reasoning frameworks to new, unseen scenarios, advancing AI moral alignment and decision-making capabilities.
Contribution
It presents a novel dataset and reinforcement learning approach for enabling LLMs to generalize moral reasoning across diverse, out-of-distribution scenarios.
Findings
Significant improvement in moral alignment scores for unseen scenarios.
Demonstrated generalization across utilitarian and deontological frameworks.
Identified training challenges and future research directions.
Abstract
Large language models are increasingly influencing human moral decisions, yet current approaches focus primarily on evaluating rather than actively steering their moral decisions. We formulate this as an out-of-distribution moral alignment problem, where LLM agents must learn to apply consistent moral reasoning frameworks to scenarios beyond their training distribution. We introduce Moral-Reason-QA, a novel dataset extending 680 human-annotated, high-ambiguity moral scenarios with framework-specific reasoning traces across utilitarian, deontological, and virtue ethics, enabling systematic evaluation of moral generalization in realistic decision contexts. Our learning approach employs Group Relative Policy Optimization with composite rewards that simultaneously optimize decision alignment and framework-specific reasoning processes to facilitate learning of the underlying moral…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsEthics and Social Impacts of AI · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning
