Leveraging Extracted Model Adversaries for Improved Black Box Attacks
Naveen Jafer Nizar, Ari Kobren

TL;DR
This paper introduces a two-step method for black box adversarial attacks on reading comprehension models, combining model extraction with white box perturbation techniques to enhance attack success.
Contribution
It proposes a novel approach that leverages extracted models to improve black box attack effectiveness against question answering systems.
Findings
Improves AddAny attack by 25% F1 on approximate models
Enhances AddSent black box attack by 11% F1
Demonstrates effectiveness on reading comprehension models
Abstract
We present a method for adversarial input generation against black box models for reading comprehension based question answering. Our approach is composed of two steps. First, we approximate a victim black box model via model extraction (Krishna et al., 2020). Second, we use our own white box method to generate input perturbations that cause the approximate model to fail. These perturbed inputs are used against the victim. In experiments we find that our method improves on the efficacy of the AddAny---a white box attack---performed on the approximate model by 25% F1, and the AddSent attack---a black box attack---by 11% F1 (Jia and Liang, 2017).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
