TL;DR
This paper introduces PS-ViT, a vision transformer with a progressive sampling strategy that adaptively locates discriminative regions, significantly improving accuracy and efficiency over vanilla ViT on ImageNet.
Contribution
The paper proposes a novel differentiable progressive sampling method integrated with ViT to better locate important regions, enhancing performance and reducing computational costs.
Findings
PS-ViT achieves 3.8% higher top-1 accuracy than vanilla ViT on ImageNet.
PS-ViT uses approximately 4 times fewer parameters.
PS-ViT requires about 10 times fewer FLOPs.
Abstract
Transformers with powerful global relation modeling abilities have been introduced to fundamental computer vision tasks recently. As a typical example, the Vision Transformer (ViT) directly applies a pure transformer architecture on image classification, by simply splitting images into tokens with a fixed length, and employing transformers to learn relations between these tokens. However, such naive tokenization could destruct object structures, assign grids to uninterested regions such as background, and introduce interference signals. To mitigate the above issues, in this paper, we propose an iterative and progressive sampling strategy to locate discriminative regions. At each iteration, embeddings of the current sampling step are fed into a transformer encoder layer, and a group of sampling offsets is predicted to update the sampling locations for the next step. The progressive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Dropout · Dense Connections · Adam · Vision Transformer · Label Smoothing
