Less is More: Understanding Word-level Textual Adversarial Attack via   n-gram Frequency Descend

Ning Lu; Shengcai Liu; Zhirui Zhang; Qi Wang; Haifeng Liu; Ke Tang

arXiv:2302.02568·cs.CL·April 16, 2024·1 cites

Less is More: Understanding Word-level Textual Adversarial Attack via n-gram Frequency Descend

Ning Lu, Shengcai Liu, Zhirui Zhang, Qi Wang, Haifeng Liu, Ke Tang

PDF

Open Access

TL;DR

This paper investigates word-level textual adversarial attacks, revealing that they often decrease n-gram frequencies, and proposes using this insight to improve NLP model robustness through frequency-based adversarial training.

Contribution

It introduces the n-gram Frequency Descend (n-FD) pattern as a key characteristic of word-level attacks and demonstrates a new robustness enhancement method using n-gram frequency information.

Findings

01

Approximately 90% of attacks cause n-gram frequency decrease.

02

Frequency-based adversarial training performs comparably to gradient-based methods.

03

Proposes a novel, intuitive perspective for understanding and defending against textual adversarial attacks.

Abstract

Word-level textual adversarial attacks have demonstrated notable efficacy in misleading Natural Language Processing (NLP) models. Despite their success, the underlying reasons for their effectiveness and the fundamental characteristics of adversarial examples (AEs) remain obscure. This work aims to interpret word-level attacks by examining their $n$ -gram frequency patterns. Our comprehensive experiments reveal that in approximately 90\% of cases, word-level attacks lead to the generation of examples where the frequency of $n$ -grams decreases, a tendency we term as the $n$ -gram Frequency Descend ( $n$ -FD). This finding suggests a straightforward strategy to enhance model robustness: training models using examples with $n$ -FD. To examine the feasibility of this strategy, we employed the $n$ -gram frequency information, as an alternative to conventional loss gradients, to generate perturbed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning

MethodsAutoencoders