Privacy Leakage in Text Classification: A Data Extraction Approach
Adel Elmahdy, Huseyin A. Inan, Robert Sim

TL;DR
This paper investigates privacy risks in text classification models by demonstrating that models can unintentionally memorize and leak training data, proposing an extraction algorithm to assess privacy vulnerabilities.
Contribution
It introduces a novel algorithm for extracting training data from text classifiers and evaluates its effectiveness in revealing private information.
Findings
Extraction of training data is feasible to some extent.
The proposed method can identify private data in trained models.
Auditing models for privacy leaks is possible using this approach.
Abstract
Recent work has demonstrated the successful extraction of training data from generative language models. However, it is not evident whether such extraction is feasible in text classification models since the training objective is to predict the class label as opposed to next-word prediction. This poses an interesting challenge and raises an important question regarding the privacy of training data in text classification settings. Therefore, we study the potential privacy leakage in the text classification domain by investigating the problem of unintended memorization of training data that is not pertinent to the learning task. We propose an algorithm to extract missing tokens of a partial text by exploiting the likelihood of the class label provided by the model. We test the effectiveness of our algorithm by inserting canaries into the training set and attempting to extract tokens in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Digital and Cyber Forensics · Internet Traffic Analysis and Secure E-voting
MethodsTest
