Fooling the Textual Fooler via Randomizing Latent Representations
Duy C. Hoang, Quang H. Nguyen, Saurav Manchanda, MinLong Peng,, Kok-Seng Wong, Khoa D. Doan

TL;DR
This paper introduces AdvFooler, a lightweight defense mechanism that randomizes latent representations at inference time to effectively thwart black-box adversarial attacks on NLP models, maintaining high accuracy.
Contribution
AdvFooler is a novel, attack-agnostic defense that does not require additional training and confuses adversaries by randomizing latent space during inference.
Findings
Achieves near state-of-the-art robustness against word-level attacks
Maintains high accuracy on clean data
Requires no extra training overhead
Abstract
Despite outstanding performance in a variety of NLP tasks, recent studies have revealed that NLP models are vulnerable to adversarial attacks that slightly perturb the input to cause the models to misbehave. Among these attacks, adversarial word-level perturbations are well-studied and effective attack strategies. Since these attacks work in black-box settings, they do not require access to the model architecture or model parameters and thus can be detrimental to existing NLP applications. To perform an attack, the adversary queries the victim model many times to determine the most important words in an input text and to replace these words with their corresponding synonyms. In this work, we propose a lightweight and attack-agnostic defense whose main goal is to perplex the process of generating an adversarial example in these query-based black-box attacks; that is to fool the textual…
Peer Reviews
Decision·Submitted to ICLR 2024
1. Low cost defense.
1. Does not perform better than prior works. 2. Similar in ideology to DP.
1. Compared with other defense methods, the proposed method AdvFooler is simple, pluggable and does not require additional computational overhead during testing or access to training. 2. The authors conducted comprehensive experiments to assess the effectiveness of AdvFooler, employing two BERT models, two distinct datasets, and three different attack methods. Furthermore, they provided qualitative analyses of their results.
1. Despite its advantages in terms of simple implementation and minimal computational overhead, AdvFooler's performance falls short of the state-of-the-art. In Table 2, on the AGNEWS dataset, AdvFooler exhibits lower accuracy under attack compared to RanMASK for the BERT-base model, and it also demonstrates lower accuracy under attack than both TMD and RanMASK for the RoBERTa-base model. 2. The selection of the hyper-parameter for noise scale in AdvFooler is not entirely clear. The authors clai
1. The paper is well-written and easy to follow. 2. AdvFooler is simple and seems to be effective against several attacks.
1. The motivation is not clear. Why does such randomization fool the attackers while not degrading the benign performance? 2. Why does AdvFooler can only perplex query-based black-box attacks? It is significant for a defense method to defend against various attacks, such as white-box attacks [1], decision-based attacks [2,3], and so on. It is necessary to validate the effectiveness against these attacks to show the generality of AdvFooler. 3. AdvFooler does not outperform the SOTA baselines ag
Code & Models
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Natural Language Processing Techniques
