Scoring Black-Box Models for Adversarial Robustness

Jian Vora; Pranay Reddy Samala

arXiv:2210.17140·cs.LG·November 1, 2022

Scoring Black-Box Models for Adversarial Robustness

Jian Vora, Pranay Reddy Samala

PDF

Open Access

TL;DR

This paper introduces a simple scoring method to evaluate the adversarial robustness of black-box models, based on LIME explanation properties, providing an efficient alternative to traditional white-box attack methods.

Contribution

The paper proposes a novel black-box scoring technique for assessing adversarial robustness using LIME explanation metrics, bypassing the need for white-box attacks.

Findings

01

Robust models have smaller $l_1$-norm of LIME weights

02

Robust models produce sharper explanations

03

The scoring correlates with adversarial robustness

Abstract

Deep neural networks are susceptible to adversarial inputs and various methods have been proposed to defend these models against adversarial attacks under different perturbation models. The robustness of models to adversarial attacks has been analyzed by first constructing adversarial inputs for the model, and then testing the model performance on the constructed adversarial inputs. Most of these attacks require the model to be white-box, need access to data labels, and finding adversarial inputs can be computationally expensive. We propose a simple scoring method for black-box models which indicates their robustness to adversarial input. We show that adversarially more robust models have a smaller $l_{1}$ -norm of LIME weights and sharper explanations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Anomaly Detection Techniques and Applications

MethodsLocal Interpretable Model-Agnostic Explanations