Bypassing DARCY Defense: Indistinguishable Universal Adversarial   Triggers

Zuquan Peng; Yuanyuan He; Jianbing Ni; Ben Niu

arXiv:2409.03183·cs.CL·September 6, 2024

Bypassing DARCY Defense: Indistinguishable Universal Adversarial Triggers

Zuquan Peng, Yuanyuan He, Jianbing Ni, Ben Niu

PDF

Open Access

TL;DR

This paper introduces IndisUAT, a new universal adversarial trigger method that can bypass existing NLP defenses like DARCY by producing indistinguishable adversarial examples, significantly reducing detection rates and causing harmful outputs.

Contribution

The paper presents IndisUAT, a novel UAT generation technique that evades current defenses and effectively attacks various NLP models and tasks.

Findings

01

IndisUAT reduces DARCY's detection true positive rate by over 40%.

02

IndisUAT decreases model accuracy by up to 51.6%.

03

IndisUAT causes GPT-2 to produce racist outputs even in non-racial contexts.

Abstract

Neural networks (NN) classification models for Natural Language Processing (NLP) are vulnerable to the Universal Adversarial Triggers (UAT) attack that triggers a model to produce a specific prediction for any input. DARCY borrows the "honeypot" concept to bait multiple trapdoors, effectively detecting the adversarial examples generated by UAT. Unfortunately, we find a new UAT generation method, called IndisUAT, which produces triggers (i.e., tokens) and uses them to craft adversarial examples whose feature distribution is indistinguishable from that of the benign examples in a randomly-chosen category at the detection layer of DARCY. The produced adversarial examples incur the maximal loss of predicting results in the DARCY-protected models. Meanwhile, the produced triggers are effective in black-box models for text generation, text inference, and reading comprehension. Finally, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Physical Unclonable Functions (PUFs) and Hardware Security · Cryptographic Implementations and Security

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Layer · Residual Connection · Linear Warmup With Cosine Annealing · Multi-Head Attention · Byte Pair Encoding · Softmax · Adam