On Effects of Steering Latent Representation for Large Language Model Unlearning
Dang Huu-Tien, Trung-Tin Pham, Hoang Thanh-Tung, and Naoya Inoue

TL;DR
This paper investigates the effects of steering latent representations in large language models for unlearning, revealing how it reduces token confidence and proposing an adaptive method to improve unlearning across layers.
Contribution
It provides a theoretical explanation for representation steering effects and introduces Adaptive RMU, enhancing unlearning effectiveness across most layers without extra computation.
Findings
Steering representations reduces token confidence, causing incorrect responses.
Adaptive RMU improves unlearning performance across layers.
RMU models are robust against adversarial jailbreaks.
Abstract
Representation Misdirection for Unlearning (RMU), which steers model representation in the intermediate layer to a target random representation, is an effective method for large language model (LLM) unlearning. Despite its high performance, the underlying cause and explanation remain underexplored. In this paper, we theoretically demonstrate that steering forget representations in the intermediate layer reduces token confidence, causing LLMs to generate wrong or nonsense responses. We investigate how the coefficient influences the alignment of forget-sample representations with the random direction and hint at the optimal coefficient values for effective unlearning across different network layers. We show that RMU unlearned models are robust against adversarial jailbreak attacks. Furthermore, our empirical analysis shows that RMU is less effective when applied to the middle and later…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsHierarchical Information Threading
