ActionHub: A Large-scale Action Video Description Dataset for Zero-shot   Action Recognition

Jiaming Zhou; Junwei Liang; Kun-Yu Lin; Jinrui Yang; Wei-Shi Zheng

arXiv:2401.11654·cs.CV·January 23, 2024·1 cites

ActionHub: A Large-scale Action Video Description Dataset for Zero-shot Action Recognition

Jiaming Zhou, Junwei Liang, Kun-Yu Lin, Jinrui Yang, Wei-Shi Zheng

PDF

Open Access

TL;DR

This paper introduces ActionHub, a large-scale dataset of 3.6 million video descriptions covering 1,211 actions, and proposes a novel CoCo framework that leverages rich semantics for improved zero-shot action recognition.

Contribution

The paper presents ActionHub, the largest dataset of its kind, and a new CoCo model that enhances semantic alignment and invariance in zero-shot action recognition.

Findings

01

Significantly outperforms state-of-the-art on ZSAR benchmarks

02

Utilizes rich video descriptions for better semantic modeling

03

Demonstrates effectiveness of cross-action invariance learning

Abstract

Zero-shot action recognition (ZSAR) aims to learn an alignment model between videos and class descriptions of seen actions that is transferable to unseen actions. The text queries (class descriptions) used in existing ZSAR works, however, are often short action names that fail to capture the rich semantics in the videos, leading to misalignment. With the intuition that video content descriptions (e.g., video captions) can provide rich contextual information of visual concepts in videos, we propose to utilize human annotated video descriptions to enrich the semantics of the class descriptions of each action. However, all existing action video description datasets are limited in terms of the number of actions, the semantics of video descriptions, etc. To this end, we collect a large-scale action video descriptions dataset named ActionHub, which covers a total of 1,211 common actions and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Cancer-related molecular mechanisms research