Generating Black-Box Adversarial Examples for Text Classifiers Using a   Deep Reinforced Model

Prashanth Vijayaraghavan; Deb Roy

arXiv:1909.07873·cs.LG·March 3, 2021

Generating Black-Box Adversarial Examples for Text Classifiers Using a Deep Reinforced Model

Prashanth Vijayaraghavan, Deb Roy

PDF

TL;DR

This paper introduces a reinforcement learning approach to generate black-box adversarial examples for text classifiers, effectively fooling models while preserving the original semantics.

Contribution

It presents a novel deep reinforcement learning method for creating semantics-preserving adversarial text examples in black-box settings.

Findings

01

Successfully fools sentiment and news categorization models

02

High success rates in generating adversarial examples

03

Adversarial examples preserve original semantics

Abstract

Recently, generating adversarial examples has become an important means of measuring robustness of a deep learning model. Adversarial examples help us identify the susceptibilities of the model and further counter those vulnerabilities by applying adversarial training techniques. In natural language domain, small perturbations in the form of misspellings or paraphrases can drastically change the semantics of the text. We propose a reinforcement learning based approach towards generating adversarial examples in black-box settings. We demonstrate that our method is able to fool well-trained models for (a) IMDB sentiment classification task and (b) AG's news corpus news categorization task with significantly high success rates. We find that the adversarial examples generated are semantics-preserving perturbations to the original text.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.