UniFormer: Unifying Convolution and Self-attention for Visual Recognition
Kunchang Li, Yali Wang, Junhao Zhang, Peng Gao, Guanglu Song, Yu Liu,, Hongsheng Li, Yu Qiao

TL;DR
UniFormer unifies convolution and self-attention in a transformer-based architecture, effectively capturing local redundancy and global dependency for diverse visual recognition tasks with state-of-the-art results.
Contribution
The paper introduces UniFormer, a novel transformer backbone that integrates convolution and self-attention, addressing redundancy and dependency issues in visual data.
Findings
Achieves 86.3% top-1 accuracy on ImageNet-1K without extra data.
State-of-the-art performance on multiple downstream tasks.
Builds an efficient version with 2-4x higher throughput.
Abstract
It is a challenging task to learn discriminative representation from images and videos, due to large local redundancy and complex global dependency in these visual data. Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years. Though CNNs can efficiently decrease local redundancy by convolution within a small neighborhood, the limited receptive field makes it hard to capture global dependency. Alternatively, ViTs can effectively capture long-range dependency via self-attention, while blind similarity comparisons among all the tokens lead to high redundancy. To resolve these problems, we propose a novel Unified transFormer (UniFormer), which can seamlessly integrate the merits of convolution and self-attention in a concise transformer format. Different from the typical transformer blocks, the relation aggregators in our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
MethodsConvolution
