Multi-Modal Few-Shot Temporal Action Detection

Sauradip Nag; Mengmeng Xu; Xiatian Zhu; Juan-Manuel Perez-Rua; Bernard; Ghanem; Yi-Zhe Song; Tao Xiang

arXiv:2211.14905·cs.CV·March 28, 2023·1 cites

Multi-Modal Few-Shot Temporal Action Detection

Sauradip Nag, Mengmeng Xu, Xiatian Zhu, Juan-Manuel Perez-Rua, Bernard, Ghanem, Yi-Zhe Song, Tao Xiang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel multi-modal few-shot temporal action detection method called MUPPET, which leverages support videos and class names using vision-language models to improve detection performance on benchmark datasets.

Contribution

The paper proposes MUPPET, a new approach that combines few-shot and zero-shot learning for temporal action detection using multi-modal prompts and meta-learning techniques.

Findings

01

MUPPET outperforms state-of-the-art methods on ActivityNetv1.3 and THUMOS14 datasets.

02

MUPPET achieves state-of-the-art results in few-shot object detection on MS-COCO.

03

The method effectively handles intra-class variation through query feature regulation.

Abstract

Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection (TAD) to new classes. The former adapts a pretrained vision model to a new task represented by as few as a single video per class, whilst the latter requires no training examples by exploiting a semantic description of the new class. In this work, we introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD by leveraging few-shot support videos and new class names jointly. To tackle this problem, we further introduce a novel MUlti-modality PromPt mETa-learning (MUPPET) method. This is enabled by efficiently bridging pretrained vision and language models whilst maximally reusing already learned capacity. Concretely, we construct multi-modal prompts by mapping support videos into the textual token space of a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sauradip/muppet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Anomaly Detection Techniques and Applications