UniFormer: Unified Transformer for Efficient Spatiotemporal   Representation Learning

Kunchang Li; Yali Wang; Peng Gao; Guanglu Song; Yu Liu; Hongsheng Li,; Yu Qiao

arXiv:2201.04676·cs.CV·February 9, 2022·108 cites

UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning

Kunchang Li, Yali Wang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li,, Yu Qiao

PDF

Open Access 2 Repos 1 Models

TL;DR

UniFormer is a novel transformer architecture that combines 3D convolution and self-attention to efficiently learn rich spatiotemporal features from videos, achieving high accuracy with fewer computations.

Contribution

This paper introduces UniFormer, a unified transformer model that effectively balances local redundancy reduction and global dependency capture in video representation learning.

Findings

01

Achieves state-of-the-art accuracy on Kinetics datasets.

02

Requires 10x fewer GFLOPs than comparable methods.

03

Performs well with only ImageNet-1K pretraining.

Abstract

It is a challenging task to learn rich and multi-scale spatiotemporal semantics from high-dimensional videos, due to large local redundancy and complex global dependency between video frames. The recent advances in this research have been mainly driven by 3D convolutional neural networks and vision transformers. Although 3D convolution can efficiently aggregate local context to suppress local redundancy from a small 3D neighborhood, it lacks the capability to capture global dependency because of the limited receptive field. Alternatively, vision transformers can effectively capture long-range dependency by self-attention mechanism, while having the limitation on reducing local redundancy with blind similarity comparison among all the tokens in each layer. Based on these observations, we propose a novel Unified transFormer (UniFormer) which seamlessly integrates merits of 3D convolution…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
Sense-X/uniformer_video
model· ♡ 8
♡ 8

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsConvolution · 3D Convolution