Beyond Forgetting: Machine Unlearning Elicits Controllable Side Behaviors and Capabilities

Tien Dang; The-Hai Nguyen; Dinh Mai Phuong; Nguyen Minh Phuong; Anh Bui; Hoang Thanh-Tung; Le-Minh Nguyen; and Naoya Inoue

arXiv:2601.21702·cs.LG·May 18, 2026

Beyond Forgetting: Machine Unlearning Elicits Controllable Side Behaviors and Capabilities

Tien Dang, The-Hai Nguyen, Dinh Mai Phuong, Nguyen Minh Phuong, Anh Bui, Hoang Thanh-Tung, Le-Minh Nguyen, and Naoya Inoue

PDF

TL;DR

This paper explores how machine unlearning in large language models can induce controllable behaviors and capabilities, revealing both risks and opportunities through the lens of the Linear Representation Hypothesis.

Contribution

It introduces a novel perspective on representation misdirection, demonstrating that unlearning can elicit controllable side behaviors and enhance capabilities in LLMs.

Findings

01

Unlearning can control model behaviors like truthfulness and sentiment.

02

Unlearning can improve in-context learning capabilities.

03

Side behaviors and capabilities are linked to high-level concept vectors.

Abstract

We consider Representation Misdirection (RM), a class of large language model (LLM) unlearning methods that achieve forgetting by redirecting the forget-representations, that is, latent representations of forget-samples, toward a target vector. Despite being important, the roles of the target vector used in RM, however, remain underexplored. Here, we approach and revisit RM through the lens of the Linear Representation Hypothesis. Specifically, if one can identify a one-dimensional representation corresponding to a high-level concept, the Linear Representation Hypothesis enables linear operations on this concept vector within the forget-representation space. Under this view, we hypothesize that, beyond forgetting, machine unlearning via RM elicits controllable emergent side behaviors and stronger side capabilities corresponding to the high-level concept. Our hypothesis is empirically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis