TransNeXt: Robust Foveal Visual Perception for Vision Transformers
Dai Shi

TL;DR
TransNeXt introduces a novel vision transformer backbone that mimics biological foveal vision, avoiding depth degradation issues and achieving state-of-the-art performance on multiple vision tasks.
Contribution
The paper proposes Aggregated Attention and Convolutional GLU, enabling effective global and local perception without stacking layers, leading to a robust and efficient vision transformer architecture.
Findings
TransNeXt surpasses ConvNeXt with 69% fewer parameters at ImageNet-1K.
Achieves 86.2% ImageNet accuracy at 384^2 resolution.
Demonstrates superior performance on object detection and segmentation tasks.
Abstract
Due to the depth degradation effect in residual connections, many efficient Vision Transformers models that rely on stacking layers for information exchange often fail to form sufficient information mixing, leading to unnatural visual perception. To address this issue, in this paper, we propose Aggregated Attention, a biomimetic design-based token mixer that simulates biological foveal vision and continuous eye movement while enabling each token on the feature map to have a global perception. Furthermore, we incorporate learnable tokens that interact with conventional queries and keys, which further diversifies the generation of affinity matrices beyond merely relying on the similarity between queries and keys. Our approach does not rely on stacking for information exchange, thus effectively avoiding depth degradation and achieving natural visual perception. Additionally, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗DaiShiResearch/transnext-tiny-224-1kmodel
- 🤗DaiShiResearch/transnext-small-224-1kmodel
- 🤗DaiShiResearch/transnext-base-224-1kmodel
- 🤗DaiShiResearch/transnext-micro-224-1kmodel
- 🤗DaiShiResearch/transnext-small-384-1k-ft-1kmodel
- 🤗DaiShiResearch/transnext-base-384-1k-ft-1kmodel
- 🤗DaiShiResearch/transnext-micro-AAAA-256-1kmodel
- 🤗DaiShiResearch/dino-4scale-transnext-tiny-cocomodel
- 🤗DaiShiResearch/dino-5scale-transnext-tiny-cocomodel
- 🤗DaiShiResearch/dino-5scale-transnext-small-cocomodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCell Image Analysis Techniques · Visual Attention and Saliency Detection · Image Processing Techniques and Applications
