Spectral Editing of Activations for Large Language Model Alignment
Yifu Qiu, Zheng Zhao, Yftah Ziser, Anna Korhonen, Edoardo M. Ponti,, Shay B. Cohen

TL;DR
This paper introduces Spectral Editing of Activations (SEA), a novel inference-time method for editing large language model representations to improve truthfulness and reduce bias, demonstrating superior effectiveness and efficiency across multiple benchmarks.
Contribution
The paper presents SEA, a new spectral editing technique for LLMs that enhances alignment by manipulating internal activations at inference time, extending to non-linear editing.
Findings
SEA outperforms existing methods in truthfulness and bias benchmarks.
SEA generalizes well to similar tasks and is computationally efficient.
SEA has limited negative impact on other model capabilities.
Abstract
Large language models (LLMs) often exhibit undesirable behaviours, such as generating untruthful or biased content. Editing their internal representations has been shown to be effective in mitigating such behaviours on top of the existing alignment methods. We propose a novel inference-time editing method, namely spectral editing of activations (SEA), to project the input representations into directions with maximal covariance with the positive demonstrations (e.g., truthful) while minimising covariance with the negative demonstrations (e.g., hallucinated). We also extend our method to non-linear editing using feature functions. We run extensive experiments on benchmarks concerning truthfulness and bias with six open-source LLMs of different sizes and model families. The results demonstrate the superiority of SEA in effectiveness, generalisation to similar tasks, as well as computation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
