Textual Unlearning Gives a False Sense of Unlearning
Jiacheng Du, Zhibo Wang, Jie Zhang, Xiaoyi Pang, Jiahui Hu, Kui Ren

TL;DR
This paper demonstrates that current textual unlearning methods in language models are ineffective at truly forgetting data, as unlearned texts can still be detected and pose privacy risks, revealing a false sense of security.
Contribution
The authors introduce rigorous auditing and attack methods to evaluate textual unlearning, revealing its vulnerabilities and exposing privacy risks in existing approaches.
Findings
Unlearned texts can be detected with high confidence after unlearning.
Textual unlearning mechanisms can leak information about the unlearned data.
Existing unlearning methods do not effectively mitigate privacy risks.
Abstract
Language Models (LMs) are prone to ''memorizing'' training data, including substantial sensitive user information. To mitigate privacy risks and safeguard the right to be forgotten, machine unlearning has emerged as a promising approach for enabling LMs to efficiently ''forget'' specific texts. However, despite the good intentions, is textual unlearning really as effective and reliable as expected? To address the concern, we first propose Unlearning Likelihood Ratio Attack+ (U-LiRA+), a rigorous textual unlearning auditing method, and find that unlearned texts can still be detected with very high confidence after unlearning. Further, we conduct an in-depth investigation on the privacy risks of textual unlearning mechanisms in deployment and present the Textual Unlearning Leakage Attack (TULA), along with its variants in both black- and white-box scenarios. We show that textual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducation and Critical Thinking Development
