Harmonizing Multi-Objective LLM Unlearning via Unified Domain Representation and Bidirectional Logit Distillation
Yisheng Zhong, Sijia Liu, Zhuangdi Zhu

TL;DR
This paper introduces a multi-objective unlearning framework for Large Language Models that harmonizes various goals like removing hazardous info, preserving utility, and ensuring robustness using domain standardization and bidirectional distillation.
Contribution
It proposes a novel approach combining unified domain representation and bidirectional logit distillation to improve multi-objective LLM unlearning.
Findings
Achieves state-of-the-art performance in multi-objective unlearning tasks.
Effectively aligns domain distributions to reduce task interference.
Enhances robustness against adversarial probing attacks.
Abstract
Large Language Models (LLMs) unlearning is crucial for removing hazardous or privacy-leaking information from the model. Practical LLM unlearning demands satisfying multiple challenging objectives simultaneously: removing undesirable knowledge, preserving general utility, avoiding over-refusal of neighboring concepts, and, crucially, ensuring robustness against adversarial probing attacks. However, existing unlearning methods primarily focus on a limited subset of these goals, typically unlearning efficacy and utility preservation while overlooking robustness and boundary behaviors. Naively extending these methods to multi-objective settings may lead to unlearning task interference. We propose a novel multi-objective unlearning framework that harmonizes multiple unlearning objectives through a data and optimization co-design: We standardize training corpora into a unified data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
