The Straight and Narrow: Do LLMs Possess an Internal Moral Path?

Luoming Hu; Jingjie Zeng; Liang Yang; Hongfei Lin

arXiv:2601.10307·cs.CL·January 16, 2026

The Straight and Narrow: Do LLMs Possess an Internal Moral Path?

Luoming Hu, Jingjie Zeng, Liang Yang, Hongfei Lin

PDF

Open Access

TL;DR

This paper explores the intrinsic moral representations of Large Language Models using Moral Foundations Theory, proposing a novel method to enhance their safety and alignment through internal moral vector manipulation.

Contribution

It introduces a cross-lingual probing technique, extracts steerable moral vectors, and develops Adaptive Moral Fusion for improved LLM safety and alignment.

Findings

01

Shared moral subspace across languages

02

Effective reduction of unsafe responses

03

Improved safety without sacrificing helpfulness

Abstract

Enhancing the moral alignment of Large Language Models (LLMs) is a critical challenge in AI safety. Current alignment techniques often act as superficial guardrails, leaving the intrinsic moral representations of LLMs largely untouched. In this paper, we bridge this gap by leveraging Moral Foundations Theory (MFT) to map and manipulate the fine-grained moral landscape of LLMs. Through cross-lingual linear probing, we validate the shared nature of moral representations in middle layers and uncover a shared yet different moral subspace between English and Chinese. Building upon this, we extract steerable Moral Vectors and successfully validate their efficacy at both internal and behavioral levels. Leveraging the high generalizability of morality, we propose Adaptive Moral Fusion (AMF), a dynamic inference-time intervention that synergizes probe detection with vector injection to tackle…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)