Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller
Min Cai, Yuchen Zhang, Shichang Zhang, Fan Yin, Dan Zhang, and Difan Zou, Yisong Yue, Ziniu Hu

TL;DR
SelfControl is a gradient-based method for controlling large language model behaviors during inference without human annotations, enabling precise, transparent, and adaptable behavior management across multiple domains.
Contribution
It introduces a novel inference-time control technique using gradients and a compact prefix module for efficient, multi-behavior control without additional latency.
Findings
Improves detoxification by 8.3% over SOTA
Enhances truthfulness by 3.1%
Reduces privacy leakage by 48.2%
Abstract
We propose SelfControl, an inference-time model control method utilizing gradients to control the behavior of large language models (LLMs) without explicit human annotations. Given a desired behavior expressed in a natural language suffix string concatenated to the input prompt, SelfControl computes gradients of the LLM's self-evaluation of the suffix with respect to its latent representations. The gradients are used to directly control the auto-regressive generation process towards desired behaviors, which eliminates human supervision, achieves precise and transparent control, and offers on-the-fly adaptability. To further enhance efficiency, we introduce SelfControl_{Prefix}, a compact module that encapsulates the learned representations from gradients into a SelfControl_{Prefix}, facilitating efficient inference-time control with no latency compared to the original model and allowing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Algorithms and Data Compression
