Random Wins All: Rethinking Grouping Strategies for Vision Tokens
Qihang Fan, Yuang Ai, Huaibo Huang, Ran He

TL;DR
This paper introduces a simple, random token grouping strategy for Vision Transformers that outperforms or matches more complex methods across various tasks and modalities, simplifying the design of grouping strategies.
Contribution
The paper proposes a unified, simple random grouping method for vision tokens that replaces complex, carefully designed grouping strategies, demonstrating broad effectiveness.
Findings
Random grouping nearly outperforms existing methods.
Random grouping shows significant advantages in downstream tasks.
Effective across multiple modalities including vision, point clouds, and vision-language models.
Abstract
Since Transformers are introduced into vision architectures, their quadratic complexity has always been a significant issue that many research efforts aim to address. A representative approach involves grouping tokens, performing self-attention calculations within each group, or pooling the tokens within each group into a single token. To this end, various carefully designed grouping strategies have been proposed to enhance the performance of Vision Transformers. Here, we pose the following questions: \textbf{Are these carefully designed grouping methods truly necessary? Is there a simpler and more unified token grouping method that can replace these diverse methods?} Therefore, we propose the random grouping strategy, which involves a simple and fast random grouping strategy for vision tokens. We validate this approach on multiple baselines, and experiments show that random grouping…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
