Probing Ethical Framework Representations in Large Language Models: Structure, Entanglement, and Methodological Challenges

Weilun Xu; Alexander Rusnak; Frederic Kaplan

arXiv:2603.23659·cs.CL·March 26, 2026

Probing Ethical Framework Representations in Large Language Models: Structure, Entanglement, and Methodological Challenges

Weilun Xu, Alexander Rusnak, Frederic Kaplan

PDF

Open Access

TL;DR

This paper investigates how large language models internally represent different ethical frameworks, revealing structured but entangled representations and highlighting methodological challenges in probing ethical reasoning.

Contribution

It introduces a systematic probing approach across multiple ethical frameworks in large language models, uncovering differentiated yet entangled ethical subspaces and discussing interpretability limitations.

Findings

01

Ethical representations form differentiated subspaces within models.

02

Probes show asymmetric transfer patterns between ethical frameworks.

03

Surface features influence probe outcomes, affecting interpretability.

Abstract

When large language models make ethical judgments, do their internal representations distinguish between normative frameworks, or collapse ethics into a single acceptability dimension? We probe hidden representations across five ethical frameworks (deontology, utilitarianism, virtue, justice, commonsense) in six LLMs spanning 4B--72B parameters. Our analysis reveals differentiated ethical subspaces with asymmetric transfer patterns -- e.g., deontology probes partially generalize to virtue scenarios while commonsense probes fail catastrophically on justice. Disagreement between deontological and utilitarian probes correlates with higher behavioral entropy across architectures, though this relationship may partly reflect shared sensitivity to scenario difficulty. Post-hoc validation reveals that probes partially depend on surface features of benchmark templates, motivating cautious…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Explainable Artificial Intelligence (XAI) · Topic Modeling