A Differentiable Language Model Adversarial Attack on Text Classifiers
Ivan Fursov, Alexey Zaytsev, Pavel Burnyshev, Ekaterina Dmitrieva,, Nikita Klyuchnikov, Andrey Kravchenko, Ekaterina Artemova, Evgeny Burnaev

TL;DR
This paper introduces a novel sentence-level adversarial attack on NLP models that fine-tunes a language model to generate hard-to-detect adversarial examples, revealing vulnerabilities in current NLP classifiers.
Contribution
It presents a differentiable, sentence-level attack method using a fine-tuned language model and a new loss function, outperforming existing attacks and challenging model robustness.
Findings
The attack outperforms competitors on various NLP tasks.
Generated adversarial examples are difficult to detect.
Current models are vulnerable to this new attack.
Abstract
Robustness of huge Transformer-based models for natural language processing is an important issue due to their capabilities and wide adoption. One way to understand and improve robustness of these models is an exploration of an adversarial attack scenario: check if a small perturbation of an input can fool a model. Due to the discrete nature of textual data, gradient-based adversarial methods, widely used in computer vision, are not applicable per~se. The standard strategy to overcome this issue is to develop token-level transformations, which do not take the whole sentence into account. In this paper, we propose a new black-box sentence-level attack. Our method fine-tunes a pre-trained language model to generate adversarial examples. A proposed differentiable loss function depends on a substitute classifier score and an approximate edit distance computed via a deep learning model.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling
