Scaling Vision Models Does Not Consistently Improve Localisation-Based Explanation Quality
Mateusz Cedro, Marcin Chlebus

TL;DR
This study evaluates whether scaling up vision models improves explanation quality, finding that larger models often do not produce better local explanations despite higher accuracy.
Contribution
It provides a comprehensive analysis showing that increased model size does not reliably enhance explanation quality in vision models.
Findings
Larger models do not significantly improve explanation metrics in most cases.
Pretraining boosts predictive accuracy but not explanation quality.
High predictive performance can occur with poor localisation explanations.
Abstract
Artificial intelligence models are increasingly scaled to improve predictive accuracy, yet it remains unclear whether scale improves the quality of post-hoc explanations. We investigate this relationship by evaluating 11 computer vision models representing increasing levels of depth and complexity within the ResNet, DenseNet, and Vision Transformer families, trained from scratch or pretrained, across three image datasets with ground-truth segmentation masks. For each model, we generate explanations using five post-hoc explainable AI methods and quantify mask alignment using two localisation metrics: Relevance Rank Accuracy (Arras et al., 2022) and the proposed Dual-Polarity Precision, which measures positive attributions inside the class mask and negative attributions outside it. Across datasets and methods, increasing architectural depth and parameter count does not improve explanation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
