Bypassing DARCY Defense: Indistinguishable Universal Adversarial Triggers
Zuquan Peng, Yuanyuan He, Jianbing Ni, Ben Niu

TL;DR
This paper introduces IndisUAT, a new universal adversarial trigger method that can bypass existing NLP defenses like DARCY by producing indistinguishable adversarial examples, significantly reducing detection rates and causing harmful outputs.
Contribution
The paper presents IndisUAT, a novel UAT generation technique that evades current defenses and effectively attacks various NLP models and tasks.
Findings
IndisUAT reduces DARCY's detection true positive rate by over 40%.
IndisUAT decreases model accuracy by up to 51.6%.
IndisUAT causes GPT-2 to produce racist outputs even in non-racial contexts.
Abstract
Neural networks (NN) classification models for Natural Language Processing (NLP) are vulnerable to the Universal Adversarial Triggers (UAT) attack that triggers a model to produce a specific prediction for any input. DARCY borrows the "honeypot" concept to bait multiple trapdoors, effectively detecting the adversarial examples generated by UAT. Unfortunately, we find a new UAT generation method, called IndisUAT, which produces triggers (i.e., tokens) and uses them to craft adversarial examples whose feature distribution is indistinguishable from that of the benign examples in a randomly-chosen category at the detection layer of DARCY. The produced adversarial examples incur the maximal loss of predicting results in the DARCY-protected models. Meanwhile, the produced triggers are effective in black-box models for text generation, text inference, and reading comprehension. Finally, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Physical Unclonable Functions (PUFs) and Hardware Security · Cryptographic Implementations and Security
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Layer · Residual Connection · Linear Warmup With Cosine Annealing · Multi-Head Attention · Byte Pair Encoding · Softmax · Adam
