SparseFormer: Sparse Visual Recognition via Limited Latent Tokens
Ziteng Gao, Zhan Tong, Limin Wang, Mike Zheng Shou

TL;DR
SparseFormer introduces a sparse visual recognition model that uses limited latent tokens to efficiently process images, achieving comparable accuracy to dense models with lower computational costs and extending to video classification.
Contribution
The paper presents SparseFormer, a novel sparse neural architecture that mimics human visual sparsity by representing images with few tokens, reducing computation while maintaining performance.
Findings
Achieves ImageNet classification performance comparable to dense models.
Offers a better accuracy-throughput tradeoff.
Extensible to video classification with lower computational costs.
Abstract
Human visual recognition is a sparse process, where only a few salient visual cues are attended to rather than traversing every detail uniformly. However, most current vision networks follow a dense paradigm, processing every single visual unit (e.g,, pixel or patch) in a uniform manner. In this paper, we challenge this dense paradigm and present a new method, coined SparseFormer, to imitate human's sparse visual recognition in an end-to-end manner. SparseFormer learns to represent images using a highly limited number of tokens (down to 49) in the latent space with sparse feature sampling procedure instead of processing dense units in the original pixel space. Therefore, SparseFormer circumvents most of dense operations on the image space and has much lower computational costs. Experiments on the ImageNet classification benchmark dataset show that SparseFormer achieves performance on…
Peer Reviews
Decision·ICLR 2024 poster
1. The proposed SparseFormer is novel and solid. 2. While maintain the performance, SparseFormer has a low memory footprint and high throughout. 3. The experiments are solid.
None
1. This paper is thoroughly motivated and exceptionally well-written. The concept of sparsifying input tokens holds paramount importance for vision transformers (ViTs) owing to the quadratic complexity with respect to sequence length in multi-head self-attention. 2. The authors have designed a functional solution, known as FocusTransformer, which improves upon the Perceiver method by introducing and dynamically adjusting regions of interest (RoIs). Experimental results compellingly demonstrate
1. While the authors have put considerable effort into elucidating the disparities between SparseFormer and Perceiver, it remains challenging for me to find a fundamental difference between these two methodologies. In my estimation, the primary distinction appears to be the introduction of the FocusTransformer. However, upon examination of this architecture, I have also observed a clear similarity to DeformableDETR. Consequently, I find it challenging to pinpoint the truly innovative contributio
1. Provides an alternative sparse paradigm ($i.e.,$) for vision modeling compared to existing Transformers. Reduces computation by operating on limited tokens. 2. Token ROI adjustment mechanism is effective at focusing on foregrounds. 3. Visualizations show the model progressively focuses on discriminative regions.
1. While the paper demonstrates the effectiveness of SparseFormer on classification tasks. The reviewer has concerns about the generalization to more complex scenarios. Appendix A.1 also points out the inferior performance compared to the recent transformer network. The use of specific sparse attention patterns might limit the model's ability to capture certain types of long-range dependencies in the images for downstream tasks. 2. In addition, the reviewer also has concerns about token ROI. Ad
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Dropout · Byte Pair Encoding · Adam · Dense Connections · Softmax
