big.LITTLE Vision Transformer for Efficient Visual Recognition
He Guo, Yulong Wang, Zixuan Ye, Jifeng Dai, Yuwen Xiong

TL;DR
The paper proposes a hybrid big.LITTLE Vision Transformer architecture that dynamically allocates tokens to high-capacity or efficient models, significantly reducing computation while maintaining high accuracy in visual recognition tasks.
Contribution
It introduces a novel dual-transformer system with a dynamic inference mechanism that improves efficiency without sacrificing performance in visual recognition.
Findings
Achieves high accuracy with reduced computational load.
Effectively balances performance and efficiency in large-scale tasks.
Demonstrates success on image classification and segmentation tasks.
Abstract
In this paper, we introduce the big.LITTLE Vision Transformer, an innovative architecture aimed at achieving efficient visual recognition. This dual-transformer system is composed of two distinct blocks: the big performance block, characterized by its high capacity and substantial computational demands, and the LITTLE efficiency block, designed for speed with lower capacity. The key innovation of our approach lies in its dynamic inference mechanism. When processing an image, our system determines the importance of each token and allocates them accordingly: essential tokens are processed by the high-performance big model, while less critical tokens are handled by the more efficient little model. This selective processing significantly reduces computational load without sacrificing the overall performance of the model, as it ensures that detailed analysis is reserved for the most…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. P-E block dual transformer design is interesting to leverage the varying importance of different tokens. 2. Semi-Cross Attention mechanism allows important tokens to still gather information from all tokens.
1. It's unclear how the FLOPs reduction translate to latency/FPS improvement. 2. It's not clear how the duel transformer modules affects self attention mechanism. 3. The proposed method relies on the distillation from pre-trained models which adds training complexity.
1. As pointed by the authors, improving the efficiency of ViT is an important topic, and this work proposed a new architecture to resolve it. 2. The ablation study in model design is comprehensive to discuss the contributions of each component.
1. The biggest concern is about the need of distillation from a vanilla model, which makes the comparison to others are unfair, based on Table 3, without distillation, it only achieves 80.8, which is worse than [email protected] in Table 1. Moreover, what vanilla model is used for the distillation is not described. 2. Following by point 1, for the results in Table 2 and the description at line 404, it shows that the significane of dislliation, this makes the comparison in Table 2 not meaningful; or on
(1) The manuscript addresses the challenges associated with sub-optimal performance and computational complexity due to the omission of unrepresentative tokens in vision transformers. The motivation for the proposed approach is clear, and the methodology is intuitive and accessible. The integration of performance and efficiency blocks is straightforward and effective. (2) The manuscript is well-organized, exhibiting clear logic and structure, which facilitates reader’s understanding.
(1) While the proposed method shows notable improvement over the baseline, it does not demonstrate significant advantages in terms of performance or computational complexity relative to other SOTA approaches. This could limit the technical merits and impacts of the work. (2) The manuscript would benefit from a more in-depth exploration and discussion of the proposed methods. Additional experiments on analyzing which tokens are utilized and how to optimize the use of important tokens are suggest
1. The paper is well written and organized. 2. The proposed P-Block and E-Block are reasonable, leading to less computational costs without compromising accuracy. 3. The experiments on image classification and SAM somehow verifies the effectiveness of the proposed method.
1. The actual speed-up of the proposed method (on GPU/CPU) should be included compared to the previous method. 2. Some related works are missed about the efficient architecture of SAM[1][2]. I think the proposed method should be compared with them. 3. The characters in the figures (such as Figure1) should be larger for the convenience of the readers. [1]Tinysam: Pushing the envelope for efficient segment anything model. [2]Efficientvit-sam: Accelerated segment anything model without performance
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors
MethodsAttention Is All You Need · Dense Connections · Residual Connection · Dropout · Layer Normalization · Adam · Byte Pair Encoding · Absolute Position Encodings · Vision Transformer · Softmax
