TL;DR
This paper introduces SVGT, an independent module for LLMs that maintains stable, normative value representations separately from the backbone, improving alignment and safety without disrupting model internal processes.
Contribution
The paper proposes a novel architecture with independent value modeling and explicit guidance, enhancing value stability and safety in large language models.
Findings
SVGT reduces harmful scores by over 70% across benchmarks.
It maintains generation fluency while improving value alignment.
Experiments show robustness across multiple backbones.
Abstract
Aligning large language models (LLMs) with human values typically relies on post-training or inference-time steering that directly manipulates the backbone's parameters or representation space. However, a critical gap exists: the model's residual stream is highly dynamic, in which values exist as fragile, low-dimensional properties, inherently incompatible with the stability required for consistent value expression. In this paper, we propose the Stable Value Guidance Transformer (SVGT), which addresses this gap through an independent value module incorporating two key designs: (1) independent value modeling, maintaining normative representations in a dedicated value space isolated from the backbone, and (2) explicit behavioral guidance, transducing these stable signals into learnable latent Bridge Tokens. These tokens serve as dynamic value anchors to explicitly steer the generative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
