A Targeted Attack on Black-Box Neural Machine Translation with Parallel Data Poisoning
Chang Xu, Jun Wang, Yuqing Tang, Francisco Guzman, Benjamin I. P., Rubinstein, Trevor Cohn

TL;DR
This paper demonstrates that black-box neural machine translation systems can be targeted and compromised through small-scale poisoning of their training data, leading to specific harmful translations even in large, state-of-the-art models.
Contribution
It introduces a practical method for poisoning training data to perform targeted attacks on black-box NMT systems, a previously underexplored security vulnerability.
Findings
Targeted poisoning achieves over 50% success rate.
Effective even with minimal poisoning budgets (0.006%).
Attacks are feasible on large-scale systems trained with tens of millions of data points.
Abstract
As modern neural machine translation (NMT) systems have been widely deployed, their security vulnerabilities require close scrutiny. Most recently, NMT systems have been found vulnerable to targeted attacks which cause them to produce specific, unsolicited, and even harmful translations. These attacks are usually exploited in a white-box setting, where adversarial inputs causing targeted translations are discovered for a known target system. However, this approach is less viable when the target system is black-box and unknown to the adversary (e.g., secured commercial systems). In this paper, we show that targeted attacks on black-box NMT systems are feasible, based on poisoning a small fraction of their parallel training data. We show that this attack can be realised practically via targeted corruption of web documents crawled to form the system's training data. We then analyse the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
