TL;DR
Explain2Attack introduces a novel black-box text adversarial attack method that leverages cross-domain interpretability to efficiently identify important words, reducing query costs while maintaining or improving attack success rates.
Contribution
The paper proposes a new black-box attack framework using an interpretable substitute model to improve efficiency and reduce queries in text adversarial attacks.
Findings
Achieves comparable or better attack success rates than state-of-the-art methods.
Requires fewer queries, making attacks more practical in real-world scenarios.
Demonstrates higher efficiency in generating adversarial examples.
Abstract
Training robust deep learning models for down-stream tasks is a critical challenge. Research has shown that down-stream models can be easily fooled with adversarial inputs that look like the training data, but slightly perturbed, in a way imperceptible to humans. Understanding the behavior of natural language models under these attacks is crucial to better defend these models against such attacks. In the black-box attack setting, where no access to model parameters is available, the attacker can only query the output information from the targeted model to craft a successful attack. Current black-box state-of-the-art models are costly in both computational complexity and number of queries needed to craft successful adversarial examples. For real world scenarios, the number of queries is critical, where less queries are desired to avoid suspicion towards an attacking agent. In this paper,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
