TransNeXt: Robust Foveal Visual Perception for Vision Transformers

Dai Shi

arXiv:2311.17132·cs.CV·April 23, 2024·20 cites

TransNeXt: Robust Foveal Visual Perception for Vision Transformers

Dai Shi

PDF

Open Access 4 Repos 10 Models

TL;DR

TransNeXt introduces a novel vision transformer backbone that mimics biological foveal vision, avoiding depth degradation issues and achieving state-of-the-art performance on multiple vision tasks.

Contribution

The paper proposes Aggregated Attention and Convolutional GLU, enabling effective global and local perception without stacking layers, leading to a robust and efficient vision transformer architecture.

Findings

01

TransNeXt surpasses ConvNeXt with 69% fewer parameters at ImageNet-1K.

02

Achieves 86.2% ImageNet accuracy at 384^2 resolution.

03

Demonstrates superior performance on object detection and segmentation tasks.

Abstract

Due to the depth degradation effect in residual connections, many efficient Vision Transformers models that rely on stacking layers for information exchange often fail to form sufficient information mixing, leading to unnatural visual perception. To address this issue, in this paper, we propose Aggregated Attention, a biomimetic design-based token mixer that simulates biological foveal vision and continuous eye movement while enabling each token on the feature map to have a global perception. Furthermore, we incorporate learnable tokens that interact with conventional queries and keys, which further diversifies the generation of affinity matrices beyond merely relying on the similarity between queries and keys. Our approach does not rely on stacking for information exchange, thus effectively avoiding depth degradation and achieving natural visual perception. Additionally, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCell Image Analysis Techniques · Visual Attention and Saliency Detection · Image Processing Techniques and Applications