Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance

Wenhao Chen; Sirui Sun; Shengyuan Bai; Guojie Song

arXiv:2605.11712·cs.AI·May 13, 2026

Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance

Wenhao Chen, Sirui Sun, Shengyuan Bai, Guojie Song

PDF

1 Repo

TL;DR

This paper introduces SVGT, an independent module for LLMs that maintains stable, normative value representations separately from the backbone, improving alignment and safety without disrupting model internal processes.

Contribution

The paper proposes a novel architecture with independent value modeling and explicit guidance, enhancing value stability and safety in large language models.

Findings

01

SVGT reduces harmful scores by over 70% across benchmarks.

02

It maintains generation fluency while improving value alignment.

03

Experiments show robustness across multiple backbones.

Abstract

Aligning large language models (LLMs) with human values typically relies on post-training or inference-time steering that directly manipulates the backbone's parameters or representation space. However, a critical gap exists: the model's residual stream is highly dynamic, in which values exist as fragile, low-dimensional properties, inherently incompatible with the stability required for consistent value expression. In this paper, we propose the Stable Value Guidance Transformer (SVGT), which addresses this gap through an independent value module incorporating two key designs: (1) independent value modeling, maintaining normative representations in a dedicated value space isolated from the backbone, and (2) explicit behavioral guidance, transducing these stable signals into learnable latent Bridge Tokens. These tokens serve as dynamic value anchors to explicitly steer the generative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Clervils/SVGT.git
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.