Where Paths Split: Localized, Calibrated Control of Moral Reasoning in Large Language Models
Chenchen Yuan, Zheyu Zhang, Gjergji Kasneci

TL;DR
This paper introduces a novel method for steering large language models towards specific moral frameworks at inference time, using minimal interventions at key transformer points to achieve calibrated ethical reasoning.
Contribution
It proposes Convergent-Divergent Routing and Dual Logit Calibration, enabling targeted, interpretable control of moral reasoning without sacrificing overall model performance.
Findings
Method reliably calibrates moral preferences in LLMs.
Interventions preserve the model's general capabilities.
Outperforms recent baseline approaches.
Abstract
Large language models often display heterogeneous moral preferences across settings. We study inference-time steering toward a desired ethical framework while preserving general competence. We present Convergent-Divergent Routing, which traces and edits minimal branch points inside transformer blocks where ethical-framework-related pathways first converge and then diverge. Gating non-target branches at these loci blocks the downstream propagation while leaving upstream computations intact. We find that this intervention alone increases targeted ethical-framework reasoning. To achieve fine-grained control, we adapt Common Spatial Patterns to the residual stream and extract, for each branch-point layer, a pair of directions that discriminate between utilitarian and deontological frameworks. We then introduce Dual Logit Calibration, a closed-form, minimum--norm update that moves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
