UniFormer: Unifying Convolution and Self-attention for Visual   Recognition

Kunchang Li; Yali Wang; Junhao Zhang; Peng Gao; Guanglu Song; Yu Liu,; Hongsheng Li; Yu Qiao

arXiv:2201.09450·cs.CV·October 28, 2024·24 cites

UniFormer: Unifying Convolution and Self-attention for Visual Recognition

Kunchang Li, Yali Wang, Junhao Zhang, Peng Gao, Guanglu Song, Yu Liu,, Hongsheng Li, Yu Qiao

PDF

Open Access 5 Repos 2 Models

TL;DR

UniFormer unifies convolution and self-attention in a transformer-based architecture, effectively capturing local redundancy and global dependency for diverse visual recognition tasks with state-of-the-art results.

Contribution

The paper introduces UniFormer, a novel transformer backbone that integrates convolution and self-attention, addressing redundancy and dependency issues in visual data.

Findings

01

Achieves 86.3% top-1 accuracy on ImageNet-1K without extra data.

02

State-of-the-art performance on multiple downstream tasks.

03

Builds an efficient version with 2-4x higher throughput.

Abstract

It is a challenging task to learn discriminative representation from images and videos, due to large local redundancy and complex global dependency in these visual data. Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years. Though CNNs can efficiently decrease local redundancy by convolution within a small neighborhood, the limited receptive field makes it hard to capture global dependency. Alternatively, ViTs can effectively capture long-range dependency via self-attention, while blind similarity comparisons among all the tokens lead to high redundancy. To resolve these problems, we propose a novel Unified transFormer (UniFormer), which can seamlessly integrate the merits of convolution and self-attention in a concise transformer format. Different from the typical transformer blocks, the relation aggregators in our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications

MethodsConvolution