Interpretable Disentanglement of Neural Networks by Extracting Class-Specific Subnetwork
Yulong Wang, Xiaolin Hu, Hang Su

TL;DR
This paper introduces a method to extract class-specific subnetworks from neural networks, enhancing interpretability and performance in explanation and adversarial detection tasks without sacrificing accuracy.
Contribution
It presents a novel approach to disentangle neural networks into class-specific subnetworks that are interpretable and maintain prediction performance.
Findings
Extracted subnetworks resemble class semantic similarities.
Improved explanation saliency accuracy in visual explanations.
Enhanced adversarial example detection rate.
Abstract
We propose a novel perspective to understand deep neural networks in an interpretable disentanglement form. For each semantic class, we extract a class-specific functional subnetwork from the original full model, with compressed structure while maintaining comparable prediction performance. The structure representations of extracted subnetworks display a resemblance to their corresponding class semantic similarities. We also apply extracted subnetworks in visual explanation and adversarial example detection tasks by merely replacing the original full model with class-specific subnetworks. Experiments demonstrate that this intuitive operation can effectively improve explanation saliency accuracy for gradient-based explanation methods, and increase the detection rate for confidence score-based adversarial example detection methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Anomaly Detection Techniques and Applications
