Mechanistic Origin of Moral Indifference in Language Models
Lingyu Li, Yan Teng, Yingchun Wang

TL;DR
This paper investigates the inherent moral indifference in large language models due to their internal representations and proposes a method to align these representations with moral concepts, improving moral reasoning capabilities.
Contribution
It identifies the root of moral indifference in LLMs' latent space and introduces a novel autoencoder-based approach to enhance their moral understanding.
Findings
Current LLMs fail to distinguish opposed moral categories.
Model scaling and alignment do not reduce moral indifference.
Reconstruction of moral features improves moral reasoning by 75% on benchmark.
Abstract
Existing behavioral alignment techniques for Large Language Models (LLMs) often neglect the discrepancy between surface compliance and internal unaligned representations, leaving LLMs vulnerable to long-tail risks. More crucially, we posit that LLMs possess an inherent state of moral indifference due to compressing distinct moral concepts into uniform probability distributions. We verify and remedy this indifference in LLMs' latent representations, utilizing 251k moral vectors constructed upon Prototype Theory and the Social-Chemistry-101 dataset. Firstly, our analysis across 23 models reveals that current LLMs fail to represent the distinction between opposed moral categories and fine-grained typicality gradients within these categories; notably, neither model scaling, architecture, nor explicit alignment reshapes this indifference. We then employ Sparse Autoencoders on Qwen3-8B,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Computational and Text Analysis Methods · Topic Modeling
