On Adversarial Examples for Character-Level Neural Machine Translation
Javid Ebrahimi, Daniel Lowd, Dejing Dou

TL;DR
This paper explores adversarial attacks on character-level neural machine translation, introducing a white-box attack method that outperforms black-box attacks and demonstrating that adversarial training enhances model robustness.
Contribution
It introduces a novel white-box adversarial attack for character-level NMT using differentiable string edits and shows its effectiveness over black-box methods.
Findings
White-box attacks are more effective than black-box attacks.
Adversarial training significantly improves robustness.
New attack methods can target specific words in translations.
Abstract
Evaluating on adversarial examples has become a standard procedure to measure robustness of deep learning models. Due to the difficulty of creating white-box adversarial examples for discrete text input, most analyses of the robustness of NLP models have been done through black-box adversarial examples. We investigate adversarial examples for character-level neural machine translation (NMT), and contrast black-box adversaries with a novel white-box adversary, which employs differentiable string-edit operations to rank adversarial changes. We propose two novel types of attacks which aim to remove or change a word in a translation, rather than simply break the NMT. We demonstrate that white-box adversarial examples are significantly stronger than their black-box counterparts in different attack scenarios, which show more serious vulnerabilities than previously known. In addition, after…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Natural Language Processing Techniques
