Internal Value Alignment in Large Language Models through Controlled Value Vector Activation

Haoran Jin; Meng Li; Xiting Wang; Zhihao Xu; Minlie Huang; Yantao Jia; Defu Lian

arXiv:2507.11316·cs.CL·July 16, 2025

Internal Value Alignment in Large Language Models through Controlled Value Vector Activation

Haoran Jin, Meng Li, Xiting Wang, Zhihao Xu, Minlie Huang, Yantao Jia, Defu Lian

PDF

Open Access 1 Video

TL;DR

This paper presents ConVA, a novel method for aligning large language models' internal values with human values by interpreting and controlling their latent representations, achieving high control success without performance loss.

Contribution

The paper introduces a new approach for internal value alignment in LLMs using controlled activation and interpretation of latent representations, ensuring consistent values without degrading performance.

Findings

01

Achieves highest control success rate across 10 basic values.

02

Maintains model performance and fluency during value control.

03

Ensures target values even with malicious prompts.

Abstract

Aligning Large Language Models (LLMs) with human values has attracted increasing attention since it provides clarity, transparency, and the ability to adapt to evolving scenarios. In this paper, we introduce a Controlled Value Vector Activation (ConVA) method that directly aligns the internal values of LLMs by interpreting how a value is encoded in their latent representations and modifies relevant activations to ensure consistent values in LLMs. To ensure an accurate and unbiased interpretation, we propose a context-controlled value vector identification method. To consistently control values without sacrificing model performance, we introduce a gated value vector activation method for effective and minimum degree of value control. Experiments show that our method achieves the highest control success rate across 10 basic values without hurting LLM performance and fluency, and ensures…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Internal Value Alignment in Large Language Models through Controlled Value Vector Activation· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling