Probing Ethical Framework Representations in Large Language Models: Structure, Entanglement, and Methodological Challenges
Weilun Xu, Alexander Rusnak, Frederic Kaplan

TL;DR
This paper investigates how large language models internally represent different ethical frameworks, revealing structured but entangled representations and highlighting methodological challenges in probing ethical reasoning.
Contribution
It introduces a systematic probing approach across multiple ethical frameworks in large language models, uncovering differentiated yet entangled ethical subspaces and discussing interpretability limitations.
Findings
Ethical representations form differentiated subspaces within models.
Probes show asymmetric transfer patterns between ethical frameworks.
Surface features influence probe outcomes, affecting interpretability.
Abstract
When large language models make ethical judgments, do their internal representations distinguish between normative frameworks, or collapse ethics into a single acceptability dimension? We probe hidden representations across five ethical frameworks (deontology, utilitarianism, virtue, justice, commonsense) in six LLMs spanning 4B--72B parameters. Our analysis reveals differentiated ethical subspaces with asymmetric transfer patterns -- e.g., deontology probes partially generalize to virtue scenarios while commonsense probes fail catastrophically on justice. Disagreement between deontological and utilitarian probes correlates with higher behavioral entropy across architectures, though this relationship may partly reflect shared sensitivity to scenario difficulty. Post-hoc validation reveals that probes partially depend on surface features of benchmark templates, motivating cautious…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Explainable Artificial Intelligence (XAI) · Topic Modeling
