TL;DR
This paper introduces a novel 'Jumbo token' approach that enhances Vision Transformers by increasing global token width, leading to faster processing, improved accuracy, and broad compatibility with existing ViT methods.
Contribution
The paper proposes a simple, scalable method to make ViTs faster by adding a global Jumbo token processed with shared, efficient FFN, maintaining compatibility and improving performance across tasks.
Findings
Jumbo tokens improve ViT speed and accuracy on ImageNet-1K.
Jumbo models outperform specialized non-ViT models in speed-accuracy trade-offs.
Jumbo enhances segmentation, pre-training, and time series modeling results.
Abstract
ViTs are general and accurate, and address many tasks, but ViTs are slow, and are not always practical when efficiency is key. Existing methods for faster ViTs design hybrid non-ViT architectures, losing generality, or shrink their tokens, sacrificing accuracy. Many non-ViT architectures are both fast and accurate. Yet, without significant modifications, they cannot do what ViTs can: process other input shapes, pre-train by SOTA self-supervised learning, reduce computation by dropping tokens, and more. We make ViTs faster by reducing patch token width while increasing global token width by adding a new Jumbo token. Our wider Jumbo token is processed by its own wider FFN to increase model capacity. Yet our Jumbo FFN is efficient: it processes a single token, for speed, and its parameters are shared across all layers, for memory. Crucially, our Jumbo is attention-only and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
