TL;DR
This paper introduces a bootstrap-based regularization method for attention scores in Vision Transformers, improving interpretability by reducing noise and spurious attention in image analysis.
Contribution
It proposes a novel statistical framework using bootstrapping to quantify uncertainty and regularize attention scores in ViT, enhancing explanation clarity.
Findings
Significant reduction of noisy and spurious attention in natural and medical images.
Improved sparsity and interpretability of attention maps.
Quantitative validation on simulation and real datasets confirms effectiveness.
Abstract
Vision transformers (ViT) rely on attention mechanism to weigh input features, and therefore attention scores have naturally been considered as explanations for its decision-making process. However, attention scores are almost always non-zero, resulting in noisy and diffused attention maps and limiting interpretability. Can we quantify uncertainty measures of attention scores and obtain regularized attention scores? To this end, we consider attention scores of ViT in a statistical framework where independent noise would lead to insignificant yet non-zero scores. Leveraging statistical learning techniques, we introduce the bootstrapping for attention scores which generates a baseline distribution of attention scores by resampling input features. Such a bootstrap distribution is then used to estimate significances and posterior probabilities of attention scores. In natural and medical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
