Knowledge Prompting for Few-shot Action Recognition
Yuheng Shi, Xinxiao Wu, Hanxi Lin

TL;DR
This paper introduces knowledge prompting, a method that leverages external commonsense action knowledge and pre-trained vision-language models to improve few-shot action recognition in videos, achieving state-of-the-art results with minimal training overhead.
Contribution
The paper proposes a novel knowledge prompting approach that uses external action knowledge bases and pre-trained models to enhance few-shot video action recognition.
Findings
Achieves state-of-the-art performance on six benchmarks.
Reduces training overhead to 0.1% of existing methods.
Effectively captures temporal evolution of actions.
Abstract
Few-shot action recognition in videos is challenging for its lack of supervision and difficulty in generalizing to unseen actions. To address this task, we propose a simple yet effective method, called knowledge prompting, which leverages commonsense knowledge of actions from external resources to prompt a powerful pre-trained vision-language model for few-shot classification. We first collect large-scale language descriptions of actions, defined as text proposals, to build an action knowledge base. The collection of text proposals is done by filling in handcraft sentence templates with external action-related corpus or by extracting action-related phrases from captions of Web instruction videos.Then we feed these text proposals into the pre-trained vision-language model along with video frames to generate matching scores of the proposals to each frame, and the scores can be treated as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Hand Gesture Recognition Systems
