Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models

Haokun Chen; Sebastian Szyller; Weilin Xu; Nageen Himayat

arXiv:2502.15836·cs.CL·September 9, 2025

Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models

Haokun Chen, Sebastian Szyller, Weilin Xu, Nageen Himayat

PDF

Open Access 1 Video

TL;DR

This paper demonstrates that soft token attacks are unreliable for auditing unlearning in large language models because they can extract any information regardless of unlearning efforts, especially in strong auditor settings.

Contribution

The study reveals the limitations of soft token attacks for auditing unlearning, showing they can elicit any information regardless of unlearning success or original data presence.

Findings

01

Soft token attacks can extract any information from LLMs, regardless of unlearning.

02

Few soft tokens (1-10) can elicit long, random strings, indicating unreliable auditing.

03

Soft token attacks are inadequate for verifying unlearning in LLMs.

Abstract

Large language models (LLMs) are trained using massive datasets, which often contain undesirable content such as harmful texts, personal information, and copyrighted material. To address this, machine unlearning aims to remove information from trained models. Recent work has shown that soft token attacks (STA) can successfully extract unlearned information from LLMs, but in this work we show that STAs can be an inadequate tool for auditing unlearning. Using common benchmarks such as Who Is Harry Potter? and TOFU, we demonstrate that in a strong auditor setting such attacks can elicit any information from the LLM, regardless of the deployed unlearning algorithm or whether the queried content was originally present in the training corpus. We further show that STA with just a few soft tokens (1-10) can elicit random strings over 400 characters long, indicating that STAs must be used…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Access Control and Trust