Vision Transformer with Sparse Scan Prior

Yuguang Zhang; Qihang Fan; Huaibo Huang

arXiv:2405.13335·cs.CV·September 11, 2025·2 cites

Vision Transformer with Sparse Scan Prior

Yuguang Zhang, Qihang Fan, Huaibo Huang

PDF

Open Access

TL;DR

This paper introduces SSViT, a vision transformer leveraging a sparse scan prior inspired by the human eye, which reduces computational load while maintaining high accuracy across vision tasks.

Contribution

The paper proposes the S^3A mechanism that models local spatial information efficiently, leading to the development of SSViT with superior performance and lower computational costs.

Findings

01

SSViT achieves 84.4%/85.7% top-1 accuracy on ImageNet without extra data.

02

SSViT reduces FLOPs significantly compared to traditional models.

03

The approach performs well across various vision tasks and datasets.

Abstract

In recent years, Transformers have achieved remarkable progress in computer vision tasks. However, their global modeling often comes with substantial computational overhead, in stark contrast to the human eye's efficient information processing. Inspired by the human eye's sparse scanning mechanism, we propose a \textbf{S}parse \textbf{S}can \textbf{S}elf-\textbf{A}ttention mechanism ( $S^{3} A$ ). This mechanism predefines a series of Anchors of Interest for each token and employs local attention to efficiently model the spatial information around these anchors, avoiding redundant global modeling and excessive focus on local information. This approach mirrors the human eye's functionality and significantly reduces the computational load of vision models. Building on $S^{3} A$ , we introduce the \textbf{S}parse \textbf{S}can \textbf{Vi}sion \textbf{T}ransformer (SSViT).…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors · Photonic and Optical Devices · Analytical Chemistry and Sensors

MethodsFocus