UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video   UniFormer

Kunchang Li; Yali Wang; Yinan He; Yizhuo Li; Yi Wang; Limin Wang; Yu; Qiao

arXiv:2211.09552·cs.CV·November 18, 2022·58 cites

UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, Yu, Qiao

PDF

Open Access 3 Repos 1 Models

TL;DR

UniFormerV2 enhances pretrained Vision Transformers with new local and global relation aggregators, achieving state-of-the-art video recognition performance across multiple benchmarks without complex pretraining.

Contribution

It introduces a unified paradigm to adapt pretrained ViTs with novel UniFormer designs, improving accuracy and efficiency in video understanding tasks.

Findings

01

Achieves 90% top-1 accuracy on Kinetics-400

02

Sets new state-of-the-art on 8 video benchmarks

03

Effectively balances accuracy and computation

Abstract

Learning discriminative spatiotemporal representation is the key problem of video understanding. Recently, Vision Transformers (ViTs) have shown their power in learning long-term video dependency with self-attention. Unfortunately, they exhibit limitations in tackling local video redundancy, due to the blind global comparison among tokens. UniFormer has successfully alleviated this issue, by unifying convolution and self-attention as a relation aggregator in the transformer format. However, this model has to require a tiresome and complicated image-pretraining phrase, before being finetuned on videos. This blocks its wide usage in practice. On the contrary, open-sourced ViTs are readily available and well-pretrained with rich image supervision. Based on these observations, we propose a generic paradigm to build a powerful family of video networks, by arming the pretrained ViTs with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
niobures/mmaction2
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Video Surveillance and Tracking Methods · Human Pose and Action Recognition

MethodsConvolution