Vision Transformer with Progressive Sampling

Xiaoyu Yue; Shuyang Sun; Zhanghui Kuang; Meng Wei; Philip Torr; Wayne; Zhang; Dahua Lin

arXiv:2108.01684·cs.CV·August 5, 2021

Vision Transformer with Progressive Sampling

Xiaoyu Yue, Shuyang Sun, Zhanghui Kuang, Meng Wei, Philip Torr, Wayne, Zhang, Dahua Lin

PDF

1 Repo

TL;DR

This paper introduces PS-ViT, a vision transformer with a progressive sampling strategy that adaptively locates discriminative regions, significantly improving accuracy and efficiency over vanilla ViT on ImageNet.

Contribution

The paper proposes a novel differentiable progressive sampling method integrated with ViT to better locate important regions, enhancing performance and reducing computational costs.

Findings

01

PS-ViT achieves 3.8% higher top-1 accuracy than vanilla ViT on ImageNet.

02

PS-ViT uses approximately 4 times fewer parameters.

03

PS-ViT requires about 10 times fewer FLOPs.

Abstract

Transformers with powerful global relation modeling abilities have been introduced to fundamental computer vision tasks recently. As a typical example, the Vision Transformer (ViT) directly applies a pure transformer architecture on image classification, by simply splitting images into tokens with a fixed length, and employing transformers to learn relations between these tokens. However, such naive tokenization could destruct object structures, assign grids to uninterested regions such as background, and introduce interference signals. To mitigate the above issues, in this paper, we propose an iterative and progressive sampling strategy to locate discriminative regions. At each iteration, embeddings of the current sampling step are fed into a transformer encoder layer, and a group of sampling offsets is predicted to update the sampling locations for the next step. The progressive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuexy/PS-ViT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Dropout · Dense Connections · Adam · Vision Transformer · Label Smoothing