SparseFormer: Sparse Visual Recognition via Limited Latent Tokens

Ziteng Gao; Zhan Tong; Limin Wang; Mike Zheng Shou

arXiv:2304.03768·cs.CV·April 10, 2023·6 cites

SparseFormer: Sparse Visual Recognition via Limited Latent Tokens

Ziteng Gao, Zhan Tong, Limin Wang, Mike Zheng Shou

PDF

Open Access 1 Repo 3 Reviews

TL;DR

SparseFormer introduces a sparse visual recognition model that uses limited latent tokens to efficiently process images, achieving comparable accuracy to dense models with lower computational costs and extending to video classification.

Contribution

The paper presents SparseFormer, a novel sparse neural architecture that mimics human visual sparsity by representing images with few tokens, reducing computation while maintaining performance.

Findings

01

Achieves ImageNet classification performance comparable to dense models.

02

Offers a better accuracy-throughput tradeoff.

03

Extensible to video classification with lower computational costs.

Abstract

Human visual recognition is a sparse process, where only a few salient visual cues are attended to rather than traversing every detail uniformly. However, most current vision networks follow a dense paradigm, processing every single visual unit (e.g,, pixel or patch) in a uniform manner. In this paper, we challenge this dense paradigm and present a new method, coined SparseFormer, to imitate human's sparse visual recognition in an end-to-end manner. SparseFormer learns to represent images using a highly limited number of tokens (down to 49) in the latent space with sparse feature sampling procedure instead of processing dense units in the original pixel space. Therefore, SparseFormer circumvents most of dense operations on the image space and has much lower computational costs. Experiments on the ImageNet classification benchmark dataset show that SparseFormer achieves performance on…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 8· accept, good paperConfidence 3

Strengths

1. The proposed SparseFormer is novel and solid. 2. While maintain the performance, SparseFormer has a low memory footprint and high throughout. 3. The experiments are solid.

Weaknesses

None

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

1. This paper is thoroughly motivated and exceptionally well-written. The concept of sparsifying input tokens holds paramount importance for vision transformers (ViTs) owing to the quadratic complexity with respect to sequence length in multi-head self-attention. 2. The authors have designed a functional solution, known as FocusTransformer, which improves upon the Perceiver method by introducing and dynamically adjusting regions of interest (RoIs). Experimental results compellingly demonstrate

Weaknesses

1. While the authors have put considerable effort into elucidating the disparities between SparseFormer and Perceiver, it remains challenging for me to find a fundamental difference between these two methodologies. In my estimation, the primary distinction appears to be the introduction of the FocusTransformer. However, upon examination of this architecture, I have also observed a clear similarity to DeformableDETR. Consequently, I find it challenging to pinpoint the truly innovative contributio

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

1. Provides an alternative sparse paradigm ($i.e.,$) for vision modeling compared to existing Transformers. Reduces computation by operating on limited tokens. 2. Token ROI adjustment mechanism is effective at focusing on foregrounds. 3. Visualizations show the model progressively focuses on discriminative regions.

Weaknesses

1. While the paper demonstrates the effectiveness of SparseFormer on classification tasks. The reviewer has concerns about the generalization to more complex scenarios. Appendix A.1 also points out the inferior performance compared to the recent transformer network. The use of specific sparse attention patterns might limit the model's ability to capture certain types of long-range dependencies in the images for downstream tasks. 2. In addition, the reviewer also has concerns about token ROI. Ad

Code & Models

Repositories

showlab/sparseformer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Dropout · Byte Pair Encoding · Adam · Dense Connections · Softmax