Red Teaming Deep Neural Networks with Feature Synthesis Tools
Stephen Casper, Yuxiao Li, Jiawei Li, Tong Bu, Kevin Zhang, Kaivalya, Hariharan, Dylan Hadfield-Menell

TL;DR
This paper introduces a benchmark for interpretability tools to detect trojans in neural networks by implanting known trojans and assessing whether tools can identify them, revealing current limitations and proposing new methods.
Contribution
It proposes trojan discovery as a new evaluation benchmark, assesses existing interpretability tools, and introduces improved feature-synthesis methods for bug detection.
Findings
State-of-the-art tools often fail to detect implanted trojans.
Benchmark reveals significant challenges in current interpretability methods.
New feature-synthesis variants improve trojan detection performance.
Abstract
Interpretable AI tools are often motivated by the goal of understanding model behavior in out-of-distribution (OOD) contexts. Despite the attention this area of study receives, there are comparatively few cases where these tools have identified previously unknown bugs in models. We argue that this is due, in part, to a common feature of many interpretability methods: they analyze model behavior by using a particular dataset. This only allows for the study of the model in the context of features that the user can sample in advance. To address this, a growing body of research involves interpreting models using \emph{feature synthesis} methods that do not depend on a dataset. In this paper, we benchmark the usefulness of interpretability tools on debugging tasks. Our key insight is that we can implant human-interpretable trojans into models and then evaluate these tools based on whether…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Machine Learning and Data Classification
