UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, Yu, Qiao

TL;DR
UniFormerV2 enhances pretrained Vision Transformers with new local and global relation aggregators, achieving state-of-the-art video recognition performance across multiple benchmarks without complex pretraining.
Contribution
It introduces a unified paradigm to adapt pretrained ViTs with novel UniFormer designs, improving accuracy and efficiency in video understanding tasks.
Findings
Achieves 90% top-1 accuracy on Kinetics-400
Sets new state-of-the-art on 8 video benchmarks
Effectively balances accuracy and computation
Abstract
Learning discriminative spatiotemporal representation is the key problem of video understanding. Recently, Vision Transformers (ViTs) have shown their power in learning long-term video dependency with self-attention. Unfortunately, they exhibit limitations in tackling local video redundancy, due to the blind global comparison among tokens. UniFormer has successfully alleviated this issue, by unifying convolution and self-attention as a relation aggregator in the transformer format. However, this model has to require a tiresome and complicated image-pretraining phrase, before being finetuned on videos. This blocks its wide usage in practice. On the contrary, open-sourced ViTs are readily available and well-pretrained with rich image supervision. Based on these observations, we propose a generic paradigm to build a powerful family of video networks, by arming the pretrained ViTs with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Video Surveillance and Tracking Methods · Human Pose and Action Recognition
MethodsConvolution
