On Effects of Steering Latent Representation for Large Language Model   Unlearning

Dang Huu-Tien; Trung-Tin Pham; Hoang Thanh-Tung; and Naoya Inoue

arXiv:2408.06223·cs.CL·February 7, 2025

On Effects of Steering Latent Representation for Large Language Model Unlearning

Dang Huu-Tien, Trung-Tin Pham, Hoang Thanh-Tung, and Naoya Inoue

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates the effects of steering latent representations in large language models for unlearning, revealing how it reduces token confidence and proposing an adaptive method to improve unlearning across layers.

Contribution

It provides a theoretical explanation for representation steering effects and introduces Adaptive RMU, enhancing unlearning effectiveness across most layers without extra computation.

Findings

01

Steering representations reduces token confidence, causing incorrect responses.

02

Adaptive RMU improves unlearning performance across layers.

03

RMU models are robust against adversarial jailbreaks.

Abstract

Representation Misdirection for Unlearning (RMU), which steers model representation in the intermediate layer to a target random representation, is an effective method for large language model (LLM) unlearning. Despite its high performance, the underlying cause and explanation remain underexplored. In this paper, we theoretically demonstrate that steering forget representations in the intermediate layer reduces token confidence, causing LLMs to generate wrong or nonsense responses. We investigate how the coefficient influences the alignment of forget-sample representations with the random direction and hint at the optimal coefficient values for effective unlearning across different network layers. We show that RMU unlearned models are robust against adversarial jailbreak attacks. Furthermore, our empirical analysis shows that RMU is less effective when applied to the middle and later…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

RebelsNLU-jaist/llm-unlearning
pytorchOfficial

Videos

On Effects of Steering Latent Representation for Large Language Model Unlearning· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsHierarchical Information Threading