Co-training Transformer with Videos and Images Improves Action   Recognition

Bowen Zhang; Jiahui Yu; Christopher Fifty; Wei Han; Andrew M. Dai,; Ruoming Pang; Fei Sha

arXiv:2112.07175·cs.CV·December 15, 2021·31 cites

Co-training Transformer with Videos and Images Improves Action Recognition

Bowen Zhang, Jiahui Yu, Christopher Fifty, Wei Han, Andrew M. Dai,, Ruoming Pang, Fei Sha

PDF

Open Access

TL;DR

This paper introduces Co-training Videos and Images for Action Recognition (CoVeR), a training paradigm that enhances video transformer models by jointly training on diverse video datasets and images, leading to improved action recognition accuracy.

Contribution

The paper proposes a novel co-training approach for video transformers that leverages both videos and images, demonstrating significant accuracy improvements across multiple datasets.

Findings

01

CoVeR improves Top-1 accuracy on Kinetics-400 by 2.4%.

02

CoVeR achieves state-of-the-art results on several action recognition datasets.

03

Joint training on diverse datasets and images enhances video transformer representations.

Abstract

In learning action recognition, models are typically pre-trained on object recognition with images, such as ImageNet, and later fine-tuned on target action recognition with videos. This approach has achieved good empirical performance especially with recent transformer-based video architectures. While recently many works aim to design more advanced transformer architectures for action recognition, less effort has been made on how to train video transformers. In this work, we explore several training paradigms and present two findings. First, video transformers benefit from joint training on diverse video datasets and label spaces (e.g., Kinetics is appearance-focused while SomethingSomething is motion-focused). Second, by further co-training with images (as single-frame videos), the video transformers learn even better video representations. We term this approach as Co-training Videos…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Medical Imaging and Analysis · Advanced Neural Network Applications