Spatio-Temporal Context Prompting for Zero-Shot Action Detection
Wei-Jhe Huang, Min-Hung Chen, Shang-Hong Lai

TL;DR
This paper introduces a novel approach for zero-shot action detection in videos by leveraging pretrained visual-language models with context prompting and interest token spotting, enabling recognition of unseen actions.
Contribution
It proposes a new method combining context prompting and interest token spotting to improve zero-shot action detection using pretrained visual-language models.
Findings
Outperforms previous methods on J-HMDB, UCF101-24, and AVA datasets.
Effectively recognizes unseen actions in videos.
Extends to multi-action videos for real-world applications.
Abstract
Spatio-temporal action detection encompasses the tasks of localizing and classifying individual actions within a video. Recent works aim to enhance this process by incorporating interaction modeling, which captures the relationship between people and their surrounding context. However, these approaches have primarily focused on fully-supervised learning, and the current limitation lies in the lack of generalization capability to recognize unseen action categories. In this paper, we aim to adapt the pretrained image-language models to detect unseen actions. To this end, we propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction. Meanwhile, our Context Prompting module will utilize contextual information to prompt labels, thereby enhancing the generation of more representative text features. Moreover, to address…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Human Pose and Action Recognition · Gait Recognition and Analysis
