Patch Shortcuts: Interpretable Proxy Models Efficiently Find Black-Box Vulnerabilities
Julia Rosenzweig, Joachim Sicking, Sebastian Houben, Michael Mock,, Maram Akila

TL;DR
This paper introduces a method using an interpretable proxy model, specifically a BagNet, to efficiently identify learned shortcuts in black-box neural networks, enhancing safety by uncovering vulnerabilities related to spurious correlations.
Contribution
The paper presents a novel approach employing an interpretable proxy model to detect shortcut-based vulnerabilities in black-box neural networks, which was previously challenging due to limited access.
Findings
Patch shortcuts significantly influence the black box model.
The proxy-based method effectively uncovers local image patch vulnerabilities.
Identified shortcuts can be transferred and validated in black-box models.
Abstract
An important pillar for safe machine learning (ML) is the systematic mitigation of weaknesses in neural networks to afford their deployment in critical applications. An ubiquitous class of safety risks are learned shortcuts, i.e. spurious correlations a network exploits for its decisions that have no semantic connection to the actual task. Networks relying on such shortcuts bear the risk of not generalizing well to unseen inputs. Explainability methods help to uncover such network vulnerabilities. However, many of these techniques are not directly applicable if access to the network is constrained, in so-called black-box setups. These setups are prevalent when using third-party ML components. To address this constraint, we present an approach to detect learned shortcuts using an interpretable-by-design network as a proxy to the black-box model of interest. Leveraging the proxy's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
