Improving Zero-Shot Action Recognition using Human Instruction with Text Description
Nan Wu, Hiroshi Kera, Kazuhiko Kawamoto

TL;DR
This paper introduces a framework that enhances zero-shot action recognition in videos by leveraging manually annotated text descriptions, improving accuracy and enabling continuous optimization.
Contribution
It proposes a novel approach using human-annotated text descriptions to boost zero-shot action recognition performance, which can be combined with other models and iteratively refined.
Findings
Achieved state-of-the-art accuracy on UCF101 and HMDB51 datasets.
Demonstrated that manual text annotations improve recognition performance.
Model can be iteratively optimized through human instructions.
Abstract
Zero-shot action recognition, which recognizes actions in videos without having received any training examples, is gaining wide attention considering it can save labor costs and training time. Nevertheless, the performance of zero-shot learning is still unsatisfactory, which limits its practical application. To solve this problem, this study proposes a framework to improve zero-shot action recognition using human instructions with text descriptions. The proposed framework manually describes video contents, which incurs some labor costs; in many situations, the labor costs are worth it. We manually annotate text features for each action, which can be a word, phrase, or sentence. Then by computing the matching degrees between the video and all text features, we can predict the class of the video. Furthermore, the proposed model can also be combined with other models to improve its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications
