Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models
Haokun Chen, Sebastian Szyller, Weilin Xu, Nageen Himayat

TL;DR
This paper demonstrates that soft token attacks are unreliable for auditing unlearning in large language models because they can extract any information regardless of unlearning efforts, especially in strong auditor settings.
Contribution
The study reveals the limitations of soft token attacks for auditing unlearning, showing they can elicit any information regardless of unlearning success or original data presence.
Findings
Soft token attacks can extract any information from LLMs, regardless of unlearning.
Few soft tokens (1-10) can elicit long, random strings, indicating unreliable auditing.
Soft token attacks are inadequate for verifying unlearning in LLMs.
Abstract
Large language models (LLMs) are trained using massive datasets, which often contain undesirable content such as harmful texts, personal information, and copyrighted material. To address this, machine unlearning aims to remove information from trained models. Recent work has shown that soft token attacks (STA) can successfully extract unlearned information from LLMs, but in this work we show that STAs can be an inadequate tool for auditing unlearning. Using common benchmarks such as Who Is Harry Potter? and TOFU, we demonstrate that in a strong auditor setting such attacks can elicit any information from the LLM, regardless of the deployed unlearning algorithm or whether the queried content was originally present in the training corpus. We further show that STA with just a few soft tokens (1-10) can elicit random strings over 400 characters long, indicating that STAs must be used…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Access Control and Trust
