Fooling the Textual Fooler via Randomizing Latent Representations

Duy C. Hoang; Quang H. Nguyen; Saurav Manchanda; MinLong Peng,; Kok-Seng Wong; Khoa D. Doan

arXiv:2310.01452·cs.CL·June 11, 2024

Fooling the Textual Fooler via Randomizing Latent Representations

Duy C. Hoang, Quang H. Nguyen, Saurav Manchanda, MinLong Peng,, Kok-Seng Wong, Khoa D. Doan

PDF

Open Access 2 Repos 1 Video 3 Reviews

TL;DR

This paper introduces AdvFooler, a lightweight defense mechanism that randomizes latent representations at inference time to effectively thwart black-box adversarial attacks on NLP models, maintaining high accuracy.

Contribution

AdvFooler is a novel, attack-agnostic defense that does not require additional training and confuses adversaries by randomizing latent space during inference.

Findings

01

Achieves near state-of-the-art robustness against word-level attacks

02

Maintains high accuracy on clean data

03

Requires no extra training overhead

Abstract

Despite outstanding performance in a variety of NLP tasks, recent studies have revealed that NLP models are vulnerable to adversarial attacks that slightly perturb the input to cause the models to misbehave. Among these attacks, adversarial word-level perturbations are well-studied and effective attack strategies. Since these attacks work in black-box settings, they do not require access to the model architecture or model parameters and thus can be detrimental to existing NLP applications. To perform an attack, the adversary queries the victim model many times to determine the most important words in an input text and to replace these words with their corresponding synonyms. In this work, we propose a lightweight and attack-agnostic defense whose main goal is to perplex the process of generating an adversarial example in these query-based black-box attacks; that is to fool the textual…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 5

Strengths

1. Low cost defense.

Weaknesses

1. Does not perform better than prior works. 2. Similar in ideology to DP.

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

1. Compared with other defense methods, the proposed method AdvFooler is simple, pluggable and does not require additional computational overhead during testing or access to training. 2. The authors conducted comprehensive experiments to assess the effectiveness of AdvFooler, employing two BERT models, two distinct datasets, and three different attack methods. Furthermore, they provided qualitative analyses of their results.

Weaknesses

1. Despite its advantages in terms of simple implementation and minimal computational overhead, AdvFooler's performance falls short of the state-of-the-art. In Table 2, on the AGNEWS dataset, AdvFooler exhibits lower accuracy under attack compared to RanMASK for the BERT-base model, and it also demonstrates lower accuracy under attack than both TMD and RanMASK for the RoBERTa-base model. 2. The selection of the hyper-parameter for noise scale in AdvFooler is not entirely clear. The authors clai

Reviewer 03Rating 3· reject, not good enoughConfidence 5

Strengths

1. The paper is well-written and easy to follow. 2. AdvFooler is simple and seems to be effective against several attacks.

Weaknesses

1. The motivation is not clear. Why does such randomization fool the attackers while not degrading the benign performance? 2. Why does AdvFooler can only perplex query-based black-box attacks? It is significant for a defense method to defend against various attacks, such as white-box attacks [1], decision-based attacks [2,3], and so on. It is necessary to validate the effectiveness against these attacks to show the generality of AdvFooler. 3. AdvFooler does not outperform the SOTA baselines ag

Code & Models

Repositories

Videos

Fooling the Textual Fooler via Randomizing Latent Representations· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Natural Language Processing Techniques