Universal Adversarial Triggers
Benedict Florance Arockiaraj, Alexander Feng, Jianxiong Cai, and Xiaoyu Cheng

TL;DR
This paper introduces a novel method for generating natural, sensible universal adversarial triggers for NLP models, improving attack realism and aiding in developing more robust models.
Contribution
The work proposes a new technique combining POS filtering and perplexity loss to generate natural triggers, enhancing attack plausibility and robustness in NLP models.
Findings
Generated triggers significantly reduce model accuracy in sentiment analysis.
Adversarial training with these triggers improves model robustness.
Triggers are more natural and less detectable than previous methods.
Abstract
Recent works have illustrated that modern NLP models trained for diverse tasks ranging from sentiment analysis to language generation succumb to universal adversarial attacks, a class of input-agnostic attacks where a common trigger sequence is used to attack the model. Although these attacks are successful, the triggers generated by such attacks are ungrammatical and unnatural. Our work proposes a novel technique combining parts-of-speech filtering and perplexity based loss function to generate sensible triggers that are closer to natural phrases. For the task of sentiment analysis on the SST dataset, the method produces sensible triggers that achieve accuracies as low as 0.04 and 0.12 for flipping positive to negative predictions and vice-versa. To build robust models, we also perform adversarial training using the generated triggers that increases the accuracy of the model from 0.12…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
