Householder Pseudo-Rotation: A Novel Approach to Activation Editing in   LLMs with Direction-Magnitude Perspective

Van-Cuong Pham; Thien Huu Nguyen

arXiv:2409.10053·cs.CL·December 10, 2024

Householder Pseudo-Rotation: A Novel Approach to Activation Editing in LLMs with Direction-Magnitude Perspective

Van-Cuong Pham, Thien Huu Nguyen

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces Householder Pseudo-Rotation, a new activation editing method for LLMs that considers direction and magnitude, leading to better performance and safety in model behavior modification.

Contribution

The paper proposes a novel activation editing technique based on direction-magnitude perspective, improving upon existing methods by preserving activation norms.

Findings

01

Enhanced safety benchmark performance

02

Preserved activation magnitudes during editing

03

Outperformed existing editing methods

Abstract

Activation Editing, which involves directly editting the internal representations of large language models (LLMs) to alter their behaviors and achieve desired properties, has emerged as a promising area of research. Existing works primarily treat LLMs' activations as points in space and modify them by adding steering vectors. However, this approach is limited in its ability to achieve greater performance improvement while maintaining the necessary consistency of activation magnitudes. To overcome these issues, we propose a novel editing method that views activations in terms of their directions and magnitudes. Our method, named Householder Pseudo-Rotation (HPR), mimics the rotation transformation, thus preserving activation norms and resulting in an improved performance on various safety benchmarks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

VinAIResearch/HPR
noneOfficial

Videos

Householder Pseudo-Rotation: A Novel Approach to Activation Editing in LLMs with Direction-Magnitude Perspective· underline

Taxonomy

TopicsIterative Learning Control Systems