UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning
Kunchang Li, Yali Wang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li,, Yu Qiao

TL;DR
UniFormer is a novel transformer architecture that combines 3D convolution and self-attention to efficiently learn rich spatiotemporal features from videos, achieving high accuracy with fewer computations.
Contribution
This paper introduces UniFormer, a unified transformer model that effectively balances local redundancy reduction and global dependency capture in video representation learning.
Findings
Achieves state-of-the-art accuracy on Kinetics datasets.
Requires 10x fewer GFLOPs than comparable methods.
Performs well with only ImageNet-1K pretraining.
Abstract
It is a challenging task to learn rich and multi-scale spatiotemporal semantics from high-dimensional videos, due to large local redundancy and complex global dependency between video frames. The recent advances in this research have been mainly driven by 3D convolutional neural networks and vision transformers. Although 3D convolution can efficiently aggregate local context to suppress local redundancy from a small 3D neighborhood, it lacks the capability to capture global dependency because of the limited receptive field. Alternatively, vision transformers can effectively capture long-range dependency by self-attention mechanism, while having the limitation on reducing local redundancy with blind similarity comparison among all the tokens in each layer. Based on these observations, we propose a novel Unified transFormer (UniFormer) which seamlessly integrates merits of 3D convolution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsConvolution · 3D Convolution
