Towards Unveiling Vulnerabilities of Large Reasoning Models in Machine Unlearning
Aobo Chen, Chenxu Zhao, Chenglin Miao, Mengdi Huai

TL;DR
This paper explores vulnerabilities in large reasoning models related to machine unlearning, proposing a novel attack method that can produce misleading reasoning traces and final answers, highlighting security risks.
Contribution
It introduces the first unlearning attack tailored for large reasoning models, with innovative optimization techniques to demonstrate potential security vulnerabilities.
Findings
The attack can force incorrect answers and misleading reasoning traces.
The proposed method is effective in both white-box and black-box settings.
Abstract
Large language models (LLMs) possess strong semantic understanding, driving significant progress in data mining applications. This is further enhanced by large reasoning models (LRMs), which provide explicit multi-step reasoning traces. On the other hand, the growing need for the right to be forgotten has driven the development of machine unlearning techniques, which aim to eliminate the influence of specific data from trained models without full retraining. However, unlearning may also introduce new security vulnerabilities by exposing additional interaction surfaces. Although many studies have investigated unlearning attacks, there is no prior work on LRMs. To bridge the gap, we first in this paper propose LRM unlearning attack that forces incorrect final answers while generating convincing but misleading reasoning traces. This objective is challenging due to non-differentiable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
