UnStar: Unlearning with Self-Taught Anti-Sample Reasoning for LLMs
Yash Sinha, Murari Mandal, Mohan Kankanhalli

TL;DR
UnSTAR introduces a novel anti-sample-based unlearning method for LLMs, enabling efficient, targeted removal of specific knowledge without affecting related information, advancing privacy and model control.
Contribution
The paper presents the concept of anti-sample-induced unlearning, a method to generate misleading rationales for targeted unlearning in LLMs, which was not previously explored.
Findings
Anti-samples effectively reverse learned associations.
The method allows fine-grained, targeted unlearning.
Anti-samples accelerate the unlearning process.
Abstract
The key components of machine learning are data samples for training, model for learning patterns, and loss function for optimizing accuracy. Analogously, unlearning can potentially be achieved through anti-data samples (or anti-samples), unlearning method, and reversed loss function. While prior research has explored unlearning methods and reversed loss functions, the potential of anti-samples remains largely untapped. In this paper, we introduce UnSTAR: Unlearning with Self-Taught Anti-Sample Reasoning for large language models (LLMs). Our contributions are threefold; first, we propose a novel concept of anti-sample-induced unlearning; second, we generate anti-samples by leveraging misleading rationales, which help reverse learned associations and accelerate the unlearning process; and third, we enable fine-grained targeted unlearning, allowing for the selective removal of specific…
Peer Reviews
Decision·Submitted to ICLR 2025
The unlearning method is able to achieve targeted unlearning (e.g. dissociation between two concepts) without harming the representation/knowledge of both concepts. The encourage of reasoning seems to be an effective way to combat adversarial attacks.
It seems that the method is significantly more involved than other unlearning methods. There is a lack of comparison of time cost for it. It also lacks comparison to other representation-based unlearning algorithms such as RMU.
1. This paper considers the problem of transferring learning to unlearning from a macro perspective on LLM learning. It divides the learning methods into three steps and successfully summarizes other methods within these steps, thus uncovering a new approach to tackle unlearning. 2. This paper evaluates a comprehensive range of LLM algorithms in its main experiments and designs various evaluation metrics (particularly metrics related to Response Quality and Hallucination Avoidance), offering mor
It appears that the completion level of this paper is not very high. It only includes a comparison of algorithms under different metrics and an analysis of iterations. Although it presents a good method, it still requires some analysis regarding the algorithm’s time complexity. For more detailed weaknesses or questions, please refer to the “Questions” section.
- I really liked how accurate and targeted the unlearning could be in terms of concepts, i.e. you can be very selective. - Paper figures and visualizations are easy to follow - Writing is clear and expressive enough - Really good and intuitive example with Harry Potter, I think it transfer the idea very clearly
- I am not sure if the novelty of the method is sufficient. Authors have described the existing problem of unlearning and existing method of using STAR and combined these methods together. It does not seem like there are any challenges to this method, however I am happy to be convinced otherwise. - Evaluation is not as strong as it only uses one dataset for unlearning and Figure 2 does not split performance by subgroups. Figure 3 is also not clear if it contributes anything to the discussion -
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
