Multi-Modal Few-Shot Temporal Action Detection
Sauradip Nag, Mengmeng Xu, Xiatian Zhu, Juan-Manuel Perez-Rua, Bernard, Ghanem, Yi-Zhe Song, Tao Xiang

TL;DR
This paper introduces a novel multi-modal few-shot temporal action detection method called MUPPET, which leverages support videos and class names using vision-language models to improve detection performance on benchmark datasets.
Contribution
The paper proposes MUPPET, a new approach that combines few-shot and zero-shot learning for temporal action detection using multi-modal prompts and meta-learning techniques.
Findings
MUPPET outperforms state-of-the-art methods on ActivityNetv1.3 and THUMOS14 datasets.
MUPPET achieves state-of-the-art results in few-shot object detection on MS-COCO.
The method effectively handles intra-class variation through query feature regulation.
Abstract
Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection (TAD) to new classes. The former adapts a pretrained vision model to a new task represented by as few as a single video per class, whilst the latter requires no training examples by exploiting a semantic description of the new class. In this work, we introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD by leveraging few-shot support videos and new class names jointly. To tackle this problem, we further introduce a novel MUlti-modality PromPt mETa-learning (MUPPET) method. This is enabled by efficiently bridging pretrained vision and language models whilst maximally reusing already learned capacity. Concretely, we construct multi-modal prompts by mapping support videos into the textual token space of a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Anomaly Detection Techniques and Applications
