Understanding Moral Reasoning Trajectories in Large Language Models: Toward Probing-Based Explainability
Fan Huang, Haewoon Kwak, Jisun An

TL;DR
This paper investigates how large language models navigate multiple ethical frameworks during moral reasoning, revealing systematic framework switching, their representation in model layers, and proposing metrics for moral consistency and explainability.
Contribution
It introduces the concept of moral reasoning trajectories, analyzes their dynamics across models, and develops metrics and techniques for probing and steering ethical framework usage in LLMs.
Findings
Over half of reasoning steps involve framework switching.
Unstable trajectories are more vulnerable to persuasive attacks.
Proposed MRC metric correlates strongly with model coherence ratings.
Abstract
Large language models (LLMs) increasingly participate in morally sensitive decision-making, yet how they organize ethical frameworks across reasoning steps remains underexplored. We introduce \textit{moral reasoning trajectories}, sequences of ethical framework invocations across intermediate reasoning steps, and analyze their dynamics across six models and three benchmarks. We find that moral reasoning involves systematic multi-framework deliberation: 55.4--57.7\% of consecutive steps involve framework switches, and only 16.4--17.8\% of trajectories remain framework-consistent. Unstable trajectories remain 1.29 more susceptible to persuasive attacks (). At the representation level, linear probes localize framework-specific encoding to model-specific layers (layer 63/81 for Llama-3.3-70B; layer 17/81 for Qwen2.5-72B), achieving 13.8--22.6\% lower KL divergence than the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Artificial Intelligence in Healthcare and Education
