Less is More: Fewer Interpretable Region via Submodular Subset Selection
Ruoyu Chen, Hua Zhang, Siyuan Liang, Jingzhi Li, Xiaochun Cao

TL;DR
This paper introduces a novel submodular subset selection approach to improve image attribution interpretability by identifying fewer, more accurate regions, addressing issues of small region accuracy and poor attribution for misclassified samples.
Contribution
It models image attribution as a submodular subset selection problem with a new function and constraints, enhancing interpretability with fewer regions and better accuracy.
Findings
Outperforms state-of-the-art methods on multiple datasets.
Improves Deletion and Insertion scores significantly for correct predictions.
Achieves substantial gains in confidence and Insertion scores for incorrect predictions.
Abstract
Image attribution algorithms aim to identify important regions that are highly relevant to model decisions. Although existing attribution solutions can effectively assign importance to target elements, they still face the following challenges: 1) existing attribution methods generate inaccurate small regions thus misleading the direction of correct attribution, and 2) the model cannot produce good attribution results for samples with wrong predictions. To address the above challenges, this paper re-models the above image attribution problem as a submodular subset selection problem, aiming to enhance model interpretability using fewer regions. To address the lack of attention to local regions, we construct a novel submodular function to discover more accurate small interpretation regions. To enhance the attribution effect for all samples, we also impose four different constraints on the…
Peer Reviews
Decision·ICLR 2024 oral
1. The paper highlights the importance of fine-grained, local regions for image attribution alongside causation of erroneous predictions to image features. 2. The novel submodular function proposed in the paper has been demonstrated to outperform several State-of-the-Art (SoTA) approaches. 3. The paper also introduce several interpretability clues such as confidence $s_{conf}$, effectiveness $s_{eff.}$, consistency $s_{cons.}$ and collaboration $s_{colla.}$ to evaluate the significance of the se
1. The idea of decomposing an input image $\mathbf{I}$ into regions has been studied for several vision tasks like self-supervision (Noroozi et al., 2016 etc.), object detection (Redmon and Farhadi, 2018) etc. These should be cited and the differences should be called out. 2. The use of saliency maps $A$ in sub-region division is unclear. The paper should highlight how saliency maps are used to evaluate patch importance. 3. Most of the scoring functions like $s_{eff.}$, $s_{cons.}$ etc. rely o
- It is impressive that their method achieves a 81.0% gain in the average highest confidence score for incorrectly predicted samples. - Treating the interpretable region identification problem as a submodular subset selection problem is a novel and interesting idea. - Their method has the ability to find the reasons that causes the prediction error for incorrectly predicted images. - Adding the ablation study at the end is a good idea.
Some suggestions for minor improvements: - The phrase "... at the level of theoretical aspects."at the introduction sounds a bit too wordy. It can be expressed more concisely. - This sentence in the introduction "Image attribution algorithm is a typical interpretable method, which produces saliency maps that explain how important image regions are to model decisions." can be better phrased, in my opinion, as "... that explain which image regions are more important to model decisions." - I don't
- This paper reformulates the attribution problem as a submodular subset selection problem that achieves higher interpretability with fewer fine-grained regions. - The proposed method enables more accurate attribution and can help find the reasons for the model to produce incorrect prediction results. - It is meaningful to verify this interpretable method on face recognition and fine-grained recognition tasks, because these tasks are closer to practical applications. - The authors provide som
- In Algorithm 1, I didn't see the use of variable k. Should n of line 3 be k? - In Table 1, why are some results of LIME and Kernel Shap not reported? Is it because these attribution algorithms have limitations on the CUB data set? Hope the author can explain it. - It would be better if the authors could discuss the limitations of this method. - Can the author further state whether the proposed method is white-box based or black-box based (assuming that the calculation of a priori saliency m
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Neural Networks and Applications
MethodsClass-activation map
