Improving Zero-Shot Action Recognition using Human Instruction with Text   Description

Nan Wu; Hiroshi Kera; Kazuhiko Kawamoto

arXiv:2301.08874·cs.CV·June 13, 2023

Improving Zero-Shot Action Recognition using Human Instruction with Text Description

Nan Wu, Hiroshi Kera, Kazuhiko Kawamoto

PDF

Open Access

TL;DR

This paper introduces a framework that enhances zero-shot action recognition in videos by leveraging manually annotated text descriptions, improving accuracy and enabling continuous optimization.

Contribution

It proposes a novel approach using human-annotated text descriptions to boost zero-shot action recognition performance, which can be combined with other models and iteratively refined.

Findings

01

Achieved state-of-the-art accuracy on UCF101 and HMDB51 datasets.

02

Demonstrated that manual text annotations improve recognition performance.

03

Model can be iteratively optimized through human instructions.

Abstract

Zero-shot action recognition, which recognizes actions in videos without having received any training examples, is gaining wide attention considering it can save labor costs and training time. Nevertheless, the performance of zero-shot learning is still unsatisfactory, which limits its practical application. To solve this problem, this study proposes a framework to improve zero-shot action recognition using human instructions with text descriptions. The proposed framework manually describes video contents, which incurs some labor costs; in many situations, the labor costs are worth it. We manually annotate text features for each action, which can be a word, phrase, or sentence. Then by computing the matching degrees between the video and all text features, we can predict the class of the video. Furthermore, the proposed model can also be combined with other models to improve its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications