The Straight and Narrow: Do LLMs Possess an Internal Moral Path?
Luoming Hu, Jingjie Zeng, Liang Yang, Hongfei Lin

TL;DR
This paper explores the intrinsic moral representations of Large Language Models using Moral Foundations Theory, proposing a novel method to enhance their safety and alignment through internal moral vector manipulation.
Contribution
It introduces a cross-lingual probing technique, extracts steerable moral vectors, and develops Adaptive Moral Fusion for improved LLM safety and alignment.
Findings
Shared moral subspace across languages
Effective reduction of unsafe responses
Improved safety without sacrificing helpfulness
Abstract
Enhancing the moral alignment of Large Language Models (LLMs) is a critical challenge in AI safety. Current alignment techniques often act as superficial guardrails, leaving the intrinsic moral representations of LLMs largely untouched. In this paper, we bridge this gap by leveraging Moral Foundations Theory (MFT) to map and manipulate the fine-grained moral landscape of LLMs. Through cross-lingual linear probing, we validate the shared nature of moral representations in middle layers and uncover a shared yet different moral subspace between English and Chinese. Building upon this, we extract steerable Moral Vectors and successfully validate their efficacy at both internal and behavioral levels. Leveraging the high generalizability of morality, we propose Adaptive Moral Fusion (AMF), a dynamic inference-time intervention that synergizes probe detection with vector injection to tackle…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)
