Controllable Value Alignment in Large Language Models through Neuron-Level Editing

Yonghui Yang; Junwei Li; Jilong Liu; Yicheng He; Fengbin Zhu; Weibiao Huang; Le Wu; Richang Hong; Tat-Seng Chua

arXiv:2602.07356·cs.LG·February 10, 2026

Controllable Value Alignment in Large Language Models through Neuron-Level Editing

Yonghui Yang, Junwei Li, Jilong Liu, Yicheng He, Fengbin Zhu, Weibiao Huang, Le Wu, Richang Hong, Tat-Seng Chua

PDF

Open Access

TL;DR

This paper introduces NeVA, a neuron-level editing framework that improves controllability and interpretability in aligning large language models with human values by reducing unintended value activations.

Contribution

NeVA is a novel neuron-level editing method that enables fine-grained, inference-time value alignment without retraining, addressing the limitations of existing steering techniques.

Findings

01

NeVA achieves stronger target value alignment.

02

NeVA causes less performance degradation on general tasks.

03

NeVA significantly reduces value leakage.

Abstract

Aligning large language models (LLMs) with human values has become increasingly important as their influence on human behavior and decision-making expands. However, existing steering-based alignment methods suffer from limited controllability: steering a target value often unintentionally activates other, non-target values. To characterize this limitation, we introduce value leakage, a diagnostic notion that captures the unintended activation of non-target values during value steering, along with a normalized leakage metric grounded in Schwartz's value theory. In light of this analysis, we propose NeVA, a neuron-level editing framework for controllable value alignment in LLMs. NeVA identifies sparse, value-relevant neurons and performs inference-time activation editing, enabling fine-grained control without parameter updates or retraining. Experiments show that NeVA achieves stronger…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)