$\textbf{AGT$^{AO}$}$: Robust and Stabilized LLM Unlearning via Adversarial Gating Training with Adaptive Orthogonality
Pengyu Li, Lingling Zhang, Zhitao Gao, Yanrui Wu, Yuxuan Dong, Huan Liu, Bifan Wei, Jun Liu

TL;DR
This paper introduces AGT$^{AO}$, a novel framework for large language model unlearning that balances effective data removal with utility preservation through adaptive orthogonality and adversarial gating mechanisms.
Contribution
The paper proposes AGT$^{AO}$, combining adaptive orthogonality and adversarial gating to improve unlearning robustness and utility in LLMs, addressing a key trade-off.
Findings
Achieves superior unlearning and utility trade-off (KUR ≈ 0.01, MMLU 58.30)
Effectively mitigates knowledge retention and recovery issues
Demonstrates robustness against adversarial recovery attempts
Abstract
While Large Language Models (LLMs) have achieved remarkable capabilities, they unintentionally memorize sensitive data, posing critical privacy and security risks. Machine unlearning is pivotal for mitigating these risks, yet existing paradigms face a fundamental dilemma: aggressive unlearning often induces catastrophic forgetting that degrades model utility, whereas conservative strategies risk superficial forgetting, leaving models vulnerable to adversarial recovery. To address this trade-off, we propose \textbf{AGT^{AO}} (Adversarial Gating Training with Adaptive Orthogonality), a unified framework designed to reconcile robust erasure with utility preservation. Specifically, our approach introduces to dynamically mitigate geometric gradient conflicts between forgetting and retention objectives, thereby minimizing unintended knowledge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning
