Spectral Editing of Activations for Large Language Model Alignment

Yifu Qiu; Zheng Zhao; Yftah Ziser; Anna Korhonen; Edoardo M. Ponti,; Shay B. Cohen

arXiv:2405.09719·cs.CL·November 5, 2024·1 cites

Spectral Editing of Activations for Large Language Model Alignment

Yifu Qiu, Zheng Zhao, Yftah Ziser, Anna Korhonen, Edoardo M. Ponti,, Shay B. Cohen

PDF

Open Access 1 Repo

TL;DR

This paper introduces Spectral Editing of Activations (SEA), a novel inference-time method for editing large language model representations to improve truthfulness and reduce bias, demonstrating superior effectiveness and efficiency across multiple benchmarks.

Contribution

The paper presents SEA, a new spectral editing technique for LLMs that enhances alignment by manipulating internal activations at inference time, extending to non-linear editing.

Findings

01

SEA outperforms existing methods in truthfulness and bias benchmarks.

02

SEA generalizes well to similar tasks and is computationally efficient.

03

SEA has limited negative impact on other model capabilities.

Abstract

Large language models (LLMs) often exhibit undesirable behaviours, such as generating untruthful or biased content. Editing their internal representations has been shown to be effective in mitigating such behaviours on top of the existing alignment methods. We propose a novel inference-time editing method, namely spectral editing of activations (SEA), to project the input representations into directions with maximal covariance with the positive demonstrations (e.g., truthful) while minimising covariance with the negative demonstrations (e.g., hallucinated). We also extend our method to non-linear editing using feature functions. We run extensive experiments on benchmarks concerning truthfulness and bias with six open-source LLMs of different sizes and model families. The results demonstrate the superiority of SEA in effectiveness, generalisation to similar tasks, as well as computation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yfqiu-nlp/sea-llm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques