Leveraging Extracted Model Adversaries for Improved Black Box Attacks

Naveen Jafer Nizar; Ari Kobren

arXiv:2010.16336·cs.LG·November 3, 2020

Leveraging Extracted Model Adversaries for Improved Black Box Attacks

Naveen Jafer Nizar, Ari Kobren

PDF

TL;DR

This paper introduces a two-step method for black box adversarial attacks on reading comprehension models, combining model extraction with white box perturbation techniques to enhance attack success.

Contribution

It proposes a novel approach that leverages extracted models to improve black box attack effectiveness against question answering systems.

Findings

01

Improves AddAny attack by 25% F1 on approximate models

02

Enhances AddSent black box attack by 11% F1

03

Demonstrates effectiveness on reading comprehension models

Abstract

We present a method for adversarial input generation against black box models for reading comprehension based question answering. Our approach is composed of two steps. First, we approximate a victim black box model via model extraction (Krishna et al., 2020). Second, we use our own white box method to generate input perturbations that cause the approximate model to fail. These perturbed inputs are used against the victim. In experiments we find that our method improves on the efficacy of the AddAny---a white box attack---performed on the approximate model by 25% F1, and the AddSent attack---a black box attack---by 11% F1 (Jia and Liang, 2017).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.