Vision Transformer with Sparse Scan Prior
Yuguang Zhang, Qihang Fan, Huaibo Huang

TL;DR
This paper introduces SSViT, a vision transformer leveraging a sparse scan prior inspired by the human eye, which reduces computational load while maintaining high accuracy across vision tasks.
Contribution
The paper proposes the S^3A mechanism that models local spatial information efficiently, leading to the development of SSViT with superior performance and lower computational costs.
Findings
SSViT achieves 84.4%/85.7% top-1 accuracy on ImageNet without extra data.
SSViT reduces FLOPs significantly compared to traditional models.
The approach performs well across various vision tasks and datasets.
Abstract
In recent years, Transformers have achieved remarkable progress in computer vision tasks. However, their global modeling often comes with substantial computational overhead, in stark contrast to the human eye's efficient information processing. Inspired by the human eye's sparse scanning mechanism, we propose a \textbf{S}parse \textbf{S}can \textbf{S}elf-\textbf{A}ttention mechanism (). This mechanism predefines a series of Anchors of Interest for each token and employs local attention to efficiently model the spatial information around these anchors, avoiding redundant global modeling and excessive focus on local information. This approach mirrors the human eye's functionality and significantly reduces the computational load of vision models. Building on , we introduce the \textbf{S}parse \textbf{S}can \textbf{Vi}sion \textbf{T}ransformer (SSViT).…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Photonic and Optical Devices · Analytical Chemistry and Sensors
MethodsFocus
