Concealed Data Poisoning Attacks on NLP Models
Eric Wallace, Tony Z. Zhao, Shi Feng, Sameer Singh

TL;DR
This paper introduces a novel data poisoning attack on NLP models that manipulates predictions based on concealed training data modifications, demonstrating effectiveness across sentiment analysis, language modeling, and translation tasks.
Contribution
The authors develop a gradient-based poisoning method that embeds triggers without explicit mention, and propose defenses to mitigate such attacks in NLP models.
Findings
Poisoned models predict positively with trigger phrases like 'James Bond'
Language models can be manipulated to generate negative outputs with hidden triggers
Translation errors can be induced using concealed poisoning techniques
Abstract
Adversarial attacks alter NLP model predictions by perturbing test-time inputs. However, it is much less understood whether, and how, predictions can be manipulated with small, concealed changes to the training data. In this work, we develop a new data poisoning attack that allows an adversary to control model predictions whenever a desired trigger phrase is present in the input. For instance, we insert 50 poison examples into a sentiment model's training set that causes the model to frequently predict Positive whenever the input contains "James Bond". Crucially, we craft these poison examples using a gradient-based procedure so that they do not mention the trigger phrase. We also apply our poison attack to language modeling ("Apple iPhone" triggers negative generations) and machine translation ("iced coffee" mistranslated as "hot coffee"). We conclude by proposing three defenses that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
