Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix   Controller

Min Cai; Yuchen Zhang; Shichang Zhang; Fan Yin; Dan Zhang; and Difan Zou; Yisong Yue; Ziniu Hu

arXiv:2406.02721·cs.CL·October 15, 2024

Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

Min Cai, Yuchen Zhang, Shichang Zhang, Fan Yin, Dan Zhang, and Difan Zou, Yisong Yue, Ziniu Hu

PDF

Open Access 1 Repo

TL;DR

SelfControl is a gradient-based method for controlling large language model behaviors during inference without human annotations, enabling precise, transparent, and adaptable behavior management across multiple domains.

Contribution

It introduces a novel inference-time control technique using gradients and a compact prefix module for efficient, multi-behavior control without additional latency.

Findings

01

Improves detoxification by 8.3% over SOTA

02

Enhances truthfulness by 3.1%

03

Reduces privacy leakage by 48.2%

Abstract

We propose SelfControl, an inference-time model control method utilizing gradients to control the behavior of large language models (LLMs) without explicit human annotations. Given a desired behavior expressed in a natural language suffix string concatenated to the input prompt, SelfControl computes gradients of the LLM's self-evaluation of the suffix with respect to its latent representations. The gradients are used to directly control the auto-regressive generation process towards desired behaviors, which eliminates human supervision, achieves precise and transparent control, and offers on-the-fly adaptability. To further enhance efficiency, we introduce SelfControl_{Prefix}, a compact module that encapsulates the learned representations from gradients into a SelfControl_{Prefix}, facilitating efficient inference-time control with no latency compared to the original model and allowing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

henrycai11/llm-self-control
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Algorithms and Data Compression