ActionCLIP: A New Paradigm for Video Action Recognition

Mengmeng Wang; Jiazheng Xing; Yong Liu

arXiv:2109.08472·cs.CV·September 20, 2021·189 cites

ActionCLIP: A New Paradigm for Video Action Recognition

Mengmeng Wang, Jiazheng Xing, Yong Liu

PDF

Open Access 2 Repos

TL;DR

ActionCLIP introduces a multimodal video-text matching framework for action recognition, enabling zero-shot and few-shot learning with improved transferability and state-of-the-art accuracy on Kinetics-400.

Contribution

It proposes a new paradigm 'pre-train, prompt and fine-tune' for action recognition, leveraging large-scale web data and semantic label texts to enhance transferability.

Findings

01

Achieves 83.8% top-1 accuracy on Kinetics-400.

02

Demonstrates superior zero-shot and few-shot transfer capabilities.

03

Outperforms existing methods in general action recognition tasks.

Abstract

The canonical approach to video action recognition dictates a neural model to do a classic and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined categories, limiting their transferable ability on new datasets with unseen concepts. In this paper, we provide a new perspective on action recognition by attaching importance to the semantic information of label texts rather than simply mapping them into numbers. Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables our model to do zero-shot action recognition without any further labeled data or parameters requirements. Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Hand Gesture Recognition Systems