$\textbf{AGT$^{AO}$}$: Robust and Stabilized LLM Unlearning via Adversarial Gating Training with Adaptive Orthogonality

Pengyu Li; Lingling Zhang; Zhitao Gao; Yanrui Wu; Yuxuan Dong; Huan Liu; Bifan Wei; Jun Liu

arXiv:2602.01703·cs.LG·February 3, 2026

$\textbf{AGT$^{AO}$}$: Robust and Stabilized LLM Unlearning via Adversarial Gating Training with Adaptive Orthogonality

Pengyu Li, Lingling Zhang, Zhitao Gao, Yanrui Wu, Yuxuan Dong, Huan Liu, Bifan Wei, Jun Liu

PDF

Open Access

TL;DR

This paper introduces AGT$^{AO}$, a novel framework for large language model unlearning that balances effective data removal with utility preservation through adaptive orthogonality and adversarial gating mechanisms.

Contribution

The paper proposes AGT$^{AO}$, combining adaptive orthogonality and adversarial gating to improve unlearning robustness and utility in LLMs, addressing a key trade-off.

Findings

01

Achieves superior unlearning and utility trade-off (KUR ≈ 0.01, MMLU 58.30)

02

Effectively mitigates knowledge retention and recovery issues

03

Demonstrates robustness against adversarial recovery attempts

Abstract

While Large Language Models (LLMs) have achieved remarkable capabilities, they unintentionally memorize sensitive data, posing critical privacy and security risks. Machine unlearning is pivotal for mitigating these risks, yet existing paradigms face a fundamental dilemma: aggressive unlearning often induces catastrophic forgetting that degrades model utility, whereas conservative strategies risk superficial forgetting, leaving models vulnerable to adversarial recovery. To address this trade-off, we propose $\textbf{AGT$ ^{AO} $}$ (Adversarial Gating Training with Adaptive Orthogonality), a unified framework designed to reconcile robust erasure with utility preservation. Specifically, our approach introduces $Adaptive Orthogonality (AO)$ to dynamically mitigate geometric gradient conflicts between forgetting and retention objectives, thereby minimizing unintended knowledge…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning