Multi-modal Prompting for Low-Shot Temporal Action Localization
Chen Ju, Zeqian Li, Peisen Zhao, Ya Zhang, Xiaopeng Zhang, Qi Tian,, Yanfeng Wang, Weidi Xie

TL;DR
This paper introduces a Transformer-based approach for low-shot temporal action localization that leverages multi-modal prompts and improved embeddings to detect and classify actions in videos, even with limited or no training examples.
Contribution
It proposes a novel multi-modal prompting framework that aligns optical flow, RGB, and text embeddings, enhancing open-vocabulary classification in low-shot scenarios.
Findings
Outperforms state-of-the-art methods on THUMOS14 and ActivityNet1.3 datasets.
Demonstrates the effectiveness of multi-modal embedding alignment.
Shows significant improvements in low-shot action localization accuracy.
Abstract
In this paper, we consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario, with the goal of detecting and classifying the action instances from arbitrary categories within some untrimmed videos, even not seen at training time. We adopt a Transformer-based two-stage action localization architecture with class-agnostic action proposal, followed by open-vocabulary classification. We make the following contributions. First, to compensate image-text foundation models with temporal motions, we improve category-agnostic action proposal by explicitly aligning embeddings of optical flows, RGB and texts, which has largely been ignored in existing low-shot methods. Second, to improve open-vocabulary action classification, we construct classifiers with strong discriminative power, i.e., avoid lexical ambiguities. To be specific, we propose to prompt the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization
MethodsContrastive Language-Image Pre-training
