SVD-ViT: Does SVD Make Vision Transformers Attend More to the Foreground?
Haruhiko Murata, Kazuhiro Hotta

TL;DR
SVD-ViT introduces a novel approach using singular value decomposition to enhance foreground focus in Vision Transformers, improving classification accuracy by suppressing background noise and artifacts.
Contribution
The paper proposes SVD-ViT, a new method that leverages SVD to explicitly prioritize foreground features in Vision Transformers, addressing their global attention limitation.
Findings
Improves classification accuracy on benchmark datasets.
Effectively suppresses background noise and artifacts.
Enhances learning of informative foreground representations.
Abstract
Vision Transformers (ViT) have been established as large-scale foundation models. However, because self-attention operates globally, they lack an explicit mechanism to distinguish foreground from background. As a result, ViT may learn unnecessary background features and artifacts, leading to degraded classification performance. To address this issue, we propose SVD-ViT, which leverages singular value decomposition (SVD) to prioritize the learning of foreground features. SVD-ViT consists of three components-\textbf{SPC module}, \textbf{SSVA}, and \textbf{ID-RSVD}-and suppresses task-irrelevant factors such as background noise and artifacts by extracting and aggregating singular vectors that capture object foreground information. Experimental results demonstrate that our method improves classification accuracy and effectively learns informative foreground representations while reducing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Domain Adaptation and Few-Shot Learning
