MUSE: Machine Unlearning Six-Way Evaluation for Language Models
Weijia Shi, Jaechan Lee, Yangsibo Huang, Sadhika Malladi, Jieyu Zhao,, Ari Holtzman, Daogao Liu, Luke Zettlemoyer, Noah A. Smith, Chiyuan Zhang

TL;DR
This paper introduces MUSE, a comprehensive benchmark for evaluating machine unlearning algorithms on language models, highlighting their strengths and weaknesses across six key properties to improve privacy, utility, scalability, and sustainability.
Contribution
We propose MUSE, a new benchmark with six evaluation criteria for unlearning algorithms, and benchmark eight algorithms on large language models to assess their effectiveness and limitations.
Findings
Most algorithms prevent verbatim and knowledge memorization to some extent.
Only one algorithm effectively prevents privacy leakage.
Existing algorithms often degrade model utility and lack sustainability for sequential unlearning.
Abstract
Language models (LMs) are trained on vast amounts of text data, which may include private and copyrighted content. Data owners may request the removal of their data from a trained model due to privacy or copyright concerns. However, exactly unlearning only these datapoints (i.e., retraining with the data removed) is intractable in modern-day models. This has led to the development of many approximate unlearning algorithms. The evaluation of the efficacy of these algorithms has traditionally been narrow in scope, failing to precisely quantify the success and practicality of the algorithm from the perspectives of both the model deployers and the data owners. We address this issue by proposing MUSE, a comprehensive machine unlearning evaluation benchmark that enumerates six diverse desirable properties for unlearned models: (1) no verbatim memorization, (2) no knowledge memorization, (3)…
Peer Reviews
Decision·ICLR 2025 Poster
This paper tackles very important problems with solid efforts to build the benchmark. The six perspectives of the benchmark are impactful and clear. I enjoy reading this work and I am convinced by the experiments, which are sufficiently comprehensive and well-designed.
Overall the weaknesses are not significant. I think the scale of data and number of methods may be further extended, for example, the conclusion of a method's effectiveness may change when forget set gets larger. Also the hyperparameter tuning may be sub-optimal and require more elaboration. Figure 4 multiple lines are using the same color which is confusing. The blue curve in Figure 6 seems completely covered. Minor: Line 414 GA should be GA_GDR?
1: This paper provides a comprehensive and detailed study of methods for machine unlearning, conducting an in-depth evaluation from six perspectives. It offers a thorough assessment framework covering aspects such as semantics, continuity, knowledge, memory, and privacy. Compared to previous evaluation frameworks, this approach has a broader scope, assesses from more perspectives, and utilizes a larger dataset, demonstrating the framework's comprehensiveness and effectiveness. 2: This paper pro
1: Although the authors provide numerous metrics, many of them heavily rely on previous methods. For example, the C3 metric for privacy assessment has already been addressed in a series of earlier approaches. Additionally, it remains unclear whether certain metrics, such as C5 and C6, are truly essential for evaluating machine unlearning, as their explanations in the paper are not entirely clear. While I appreciate the advantages noted in Strengths, I would like to see more about how these vari
- What the authors propose is very helpful for the community. Plenty of work is focused on developing approximate unlearning methods for LLMs, and evaluation methods employed are all too often ad-hoc rather than comprehensive and rigorous. They setup authors propose is well-thought through and covers all meaningful dimensions (at least that I see). - Thorough analysis of unlearning algorithms, for 2 datasets - I particularly like the realistic setup for utility preservation of Harry Potter.
(more see questions) - Only evaluating one LLM, in one finetuning regime. - No justification for the high MIA performances, which is needed to evaluate how realistic the setup and its conclusions are. - Limited utility evaluation. - Minor clarifications needed
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
