Building Interpretable Models for Moral Decision-Making

Mayank Goel; Aritra Das; Paras Chopra

arXiv:2602.03351·cs.AI·February 5, 2026

Building Interpretable Models for Moral Decision-Making

Mayank Goel, Aritra Das, Paras Chopra

PDF

Open Access 1 Video

TL;DR

This paper develops a small transformer model to analyze moral decision-making in trolley dilemmas, achieving high accuracy and providing insights into how moral reasoning is represented within neural network stages.

Contribution

It introduces a custom transformer architecture for moral reasoning and applies interpretability methods to reveal how biases and moral judgments are encoded internally.

Findings

01

Model achieves 77% accuracy on Moral Machine data.

02

Biases are localized to specific computational stages.

03

Interpretability techniques uncover how moral reasoning distributes across the network.

Abstract

We build a custom transformer model to study how neural networks make moral decisions on trolley-style dilemmas. The model processes structured scenarios using embeddings that encode who is affected, how many people, and which outcome they belong to. Our 2-layer architecture achieves 77% accuracy on Moral Machine data while remaining small enough for detailed analysis. We use different interpretability techniques to uncover how moral reasoning distributes across the network, demonstrating that biases localize to distinct computational stages among other findings.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Building Interpretable Models for Moral Decision-Making· underline

Taxonomy

TopicsPsychology of Moral and Emotional Judgment · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI