Refusal in LLMs is an Affine Function

Thomas Marshall; Adam Scherlis; Nora Belrose

arXiv:2411.09003·cs.LG·January 29, 2025

Refusal in LLMs is an Affine Function

Thomas Marshall, Adam Scherlis, Nora Belrose

PDF

Open Access 1 Repo 2 Datasets

TL;DR

This paper introduces affine concept editing (ACE), a novel method for steering language model behavior by intervening directly in activations, enabling precise control over refusal responses across multiple models.

Contribution

ACE is a new approach that decomposes activations affine-wise and combines subspace projection with activation addition for improved model steering.

Findings

01

ACE reliably controls refusal behavior across models.

02

ACE outperforms existing methods in precision.

03

ACE generalizes better to different models.

Abstract

We propose affine concept editing (ACE) as an approach for steering language models' behavior by intervening directly in activations. We begin with an affine decomposition of model activation vectors and show that prior methods for steering model behavior correspond to subsets of terms of this decomposition. We then provide a derivation of ACE and use it to control refusal behavior on ten different models, including Llama 3 70B. ACE combines affine subspace projection and activation addition to reliably control the model's refusal responses across prompt types. We evaluate the results using LLM-based scoring on a collection of harmful and harmless prompts. Our experiments demonstrate that ACE consistently achieves more precise control over model behavior than existing methods and generalizes to models where directional ablation via affine subspace projection alone produces incoherent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

eleutherai/steering-llama3
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation

MethodsLLaMA