Tracing Moral Foundations in Large Language Models
Chenxiao Yu, Bowen Yi, Farzan Karimi-Malekabadi, Suhaib Abdurahman, Jinyi Ye, Shrikanth Narayanan, Yue Zhao, Morteza Dehghani

TL;DR
This study investigates how large language models encode and organize moral foundations, revealing that these models develop a structured, human-aligned moral understanding through training, which can be causally influenced.
Contribution
It introduces a multi-level analytical framework combining layer-wise analysis, autoencoders, and causal interventions to uncover the internal moral representations in LLMs.
Findings
Models encode moral foundations aligning with human judgments.
Moral geometry emerges naturally from pretraining and can be rewired post-training.
Sparse autoencoder features relate semantically to specific moral foundations.
Abstract
Large language models often produce human-like moral judgments, but it is unclear whether this reflects an internal conceptual structure or superficial ``moral mimicry.'' Using Moral Foundations Theory (MFT) as an analytic framework, we study how moral foundations are encoded, organized, and expressed across 14 base and instruction-tuned LLMs spanning four model families (Llama, Qwen2.5, Qwen3-MoE, Mistral) and scales from 7B to 70B. We employ a multi-level approach combining (i) layer-wise analysis of MFT concept representations and their alignment with human moral perceptions, (ii) pretrained sparse autoencoders (SAEs) over the residual stream to identify sparse features that support moral concepts, and (iii) causal steering interventions using dense MFT vectors and sparse SAE features. We find that models represent and distinguish moral foundations in a manner that aligns with human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
