Building Interpretable Models for Moral Decision-Making
Mayank Goel, Aritra Das, Paras Chopra

TL;DR
This paper develops a small transformer model to analyze moral decision-making in trolley dilemmas, achieving high accuracy and providing insights into how moral reasoning is represented within neural network stages.
Contribution
It introduces a custom transformer architecture for moral reasoning and applies interpretability methods to reveal how biases and moral judgments are encoded internally.
Findings
Model achieves 77% accuracy on Moral Machine data.
Biases are localized to specific computational stages.
Interpretability techniques uncover how moral reasoning distributes across the network.
Abstract
We build a custom transformer model to study how neural networks make moral decisions on trolley-style dilemmas. The model processes structured scenarios using embeddings that encode who is affected, how many people, and which outcome they belong to. Our 2-layer architecture achieves 77% accuracy on Moral Machine data while remaining small enough for detailed analysis. We use different interpretability techniques to uncover how moral reasoning distributes across the network, demonstrating that biases localize to distinct computational stages among other findings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsPsychology of Moral and Emotional Judgment · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI
