Teach LLMs to Phish: Stealing Private Information from Language Models
Ashwinee Panda, Christopher A. Choquette-Choo, Zhengming Zhang,, Yaoqing Yang, Prateek Mittal

TL;DR
This paper introduces 'neural phishing,' a practical attack method that can extract sensitive personal information from large language models trained on private data, with high success rates, even with minimal data insertion.
Contribution
The authors propose a novel data extraction attack called neural phishing that effectively retrieves private information from language models with limited prior knowledge.
Findings
Attack success rates up to 50%
Effective with only tens of benign sentences inserted
Requires minimal prior knowledge about user data
Abstract
When large language models are trained on private data, it can be a significant privacy risk for them to memorize and regurgitate sensitive information. In this work, we propose a new practical data extraction attack that we call "neural phishing". This attack enables an adversary to target and extract sensitive or personally identifiable information (PII), e.g., credit card numbers, from a model trained on user data with upwards of 10% attack success rates, at times, as high as 50%. Our attack assumes only that an adversary can insert as few as 10s of benign-appearing sentences into the training dataset using only vague priors on the structure of the user data.
Peer Reviews
Decision·ICLR 2024 poster
- Section 2.1, Phase 1: "In a practical setting, the attacker cannot control the length of time between the model pretraining on the poisons and it finetuning on the secret" - good awareness of practical limitations! - Section 4: not assigning partial credit to "nearly accurate" completions is good. - Page 7: "We recognize this is a very strong assumption; we just use this to illustrate the upper bound, and to better control the randomness in the below ablations" - once again, good awareness o
- Page 8 "So far we have assumed that the attacker knows the secret prefix exactly in Phase III of the attack (inference), even when they don’t know the secret prefix in Phase I" - this is a very strong assumption! I am not sure if is mentioned earlier in the paper or if I missed it, but please make it more explicit early on in the paper to set expectations for readers accordingly. - Page 9: "We have assumed that the attacker is able to immediately prompt the model after it has seen the secret
1. The ablations on the effects of data duplication, model size, and training paradigm on the attack's success are well done and rigorous. It greatly improves the quality of the work and demonstrates how such an attack would operate under different situations. Most importantly, it reduces the worry that such an attack is only possible under certain configurations of LLMs and not something more fundamental, which this paper implies. 2. The related work is well done. The authors portray how this
1. Is "prior" a well-defined term in the literature for this term? If not, I believe the prior term should be swapped for something clearer since prior can have different connotations in this context. 2. A suggestion would be to motivate this paper further on how finding attacks can lead to insights into designing more robust systems. Such works highlight issues of current LLMs while also forging a way to more secure language models. While not a critique of this paper, it would be nice to motiv
1. It’s nice to see data poisoning can still increase the success rate of data extraction in the pre-training + fine-tuning pipeline. Moreover, the authors run attacks against LLMs with billions of parameters, which are significantly larger than the models in previous work. 2. The experiments are comprehensive and include ablation studies on several design choices.
1. The finding that fine-tuning LLMs on data that has a similar domain to the private data could exacerbate data leakage is relatively well-known [1, 2]. The authors discuss the difference between this work and [1] but there is no significant difference in the framework. 2. The poisoning dataset is separated from the pre-training corpus. After reading Figure 1, I thought the poisoning dataset is mixed with the pre-training corpus before pre-training, and the attacks are done against the pre-tra
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Artificial Intelligence in Law
