Interpreting Robustness Proofs of Deep Neural Networks
Debangshu Banerjee, Avaljot Singh, Gagandeep Singh

TL;DR
This paper develops methods to generate human-understandable interpretations of robustness proofs for deep neural networks, revealing reliance on spurious features and improvements with combined training methods.
Contribution
It introduces new concepts and algorithms for interpretable robustness proofs, bridging the gap between formal verification and human understanding.
Findings
Standard DNNs rely on spurious features for robustness proofs.
Provably robust DNNs filter out meaningful features in proofs.
Combined adversarial and provable training enhances interpretability and feature filtering.
Abstract
In recent years numerous methods have been developed to formally verify the robustness of deep neural networks (DNNs). Though the proposed techniques are effective in providing mathematical guarantees about the DNNs behavior, it is not clear whether the proofs generated by these methods are human-interpretable. In this paper, we bridge this gap by developing new concepts, algorithms, and representations to generate human understandable interpretations of the proofs. Leveraging the proposed method, we show that the robustness proofs of standard DNNs rely on spurious input features, while the proofs of DNNs trained to be provably robust filter out even the semantically meaningful features. The proofs for the DNNs combining adversarial and provably robust training are the most effective at selectively filtering out spurious features as well as relying on human-understandable input features.
Peer Reviews
Decision·ICLR 2024 poster
The paper is well-structured and technically solid. To the best of my knowledge, most existing works focus on debugging neural networks and explaining how they make predictions. In contrast, this paper proposes to identify the most influential proof features for a DNN verifier, which is different and new. I also like how the paper presents the problem formulation for proof dissection, where the three expectations are discussed in great detail. The proposed ProFit algorithm and how it approximate
While I most enjoyed reading the paper, the significance of generating human-understandable proof features and the user case of the extracted proof features need to be explained more clearly. A DNN verifier is supposed to provide a mathematically sound robustness certificate, so by design, it gives a result that can be trusted. However, this paper aims to identify important/human-understandable proof features that can explain how the DNN verifier works, so I wonder why we need a DNN verifier to
The paper proposes a theoretically-inspired heuristic algorithm that seems to work as expected. The algorithm provides improved interpretability to the robust verification problem. Furthermore, the qualitative comparisons are constructive. The paper is generally well-structured, the notations are mostly well-defined, and the experiment results are clearly presented. The proposed algorithm is general and is compatible with existing certification methods.
- I find the motivation of this work to be a little ambiguous. Specifically, the author argues that the challenge this work aims to address here is "investigating the entire set $\mathcal{F}$ is always a valid but expensive option considering the size of $\mathcal{F}$". However, Algorithm 1 needs to iteratively query the verifier on subsets of $\mathcal{F}$, with the subset in the first couple of iterations potentially quite large. Particularly, at the first iteration, the cardinality of $F_{S_0
- The paper is well written - The problem of analysing and explain the decisions of the various methods developed to verify neural networks is of interest to the ICLR community and to the best of my knowledge novel - The algorithm is sound and experimental results seem to show improvement compared to state of the art
- The experimental setting is not totally clear to me: 1) how is the original feature count in Table 1 computed? Should not this be just the size of the penultimate layer? Also why is this 10 times larger for PGD trained networks on MNIST compared to PGD trained networks on CIFAR? 2) As ProFIt has a subroutine that needs to check for sufficiency of a feature set, It would be good to perform experiment to analyse the scalability of the proposed method on various architectures for the various
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Advanced Neural Network Applications
