Aligning Large Language Models with Representation Editing: A Control   Perspective

Lingkai Kong; Haorui Wang; Wenhao Mu; Yuanqi Du; Yuchen Zhuang; Yifei; Zhou; Yue Song; Rongzhi Zhang; Kai Wang; Chao Zhang

arXiv:2406.05954·cs.AI·November 5, 2024

Aligning Large Language Models with Representation Editing: A Control Perspective

Lingkai Kong, Haorui Wang, Wenhao Mu, Yuanqi Du, Yuchen Zhuang, Yifei, Zhou, Yue Song, Rongzhi Zhang, Kai Wang, Chao Zhang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a novel method for aligning large language models by editing their internal representations through control signals, achieving better alignment with fewer resources than traditional fine-tuning.

Contribution

It presents a new approach viewing LLMs as dynamical systems and introduces control signals trained via value functions for efficient alignment.

Findings

01

Outperforms existing test-time alignment techniques.

02

Requires fewer resources than fine-tuning.

03

Demonstrates effective alignment through representation editing.

Abstract

Aligning large language models (LLMs) with human objectives is crucial for real-world applications. However, fine-tuning LLMs for alignment often suffers from unstable training and requires substantial computing resources. Test-time alignment techniques, such as prompting and guided decoding, do not modify the underlying model, and their performance remains dependent on the original model's capabilities. To address these challenges, we propose aligning LLMs through representation editing. The core of our method is to view a pre-trained autoregressive LLM as a discrete-time stochastic dynamical system. To achieve alignment for specific objectives, we introduce external control signals into the state space of this language dynamical system. We train a value function directly on the hidden states according to the Bellman equation, enabling gradient-based optimization to obtain the optimal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lingkai-kong/re-control
pytorchOfficial

Videos

Aligning Large Language Models with Representation Editing: A Control Perspective· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling