Multi-modal Prompting for Low-Shot Temporal Action Localization

Chen Ju; Zeqian Li; Peisen Zhao; Ya Zhang; Xiaopeng Zhang; Qi Tian,; Yanfeng Wang; Weidi Xie

arXiv:2303.11732·cs.CV·March 22, 2023·5 cites

Multi-modal Prompting for Low-Shot Temporal Action Localization

Chen Ju, Zeqian Li, Peisen Zhao, Ya Zhang, Xiaopeng Zhang, Qi Tian,, Yanfeng Wang, Weidi Xie

PDF

Open Access

TL;DR

This paper introduces a Transformer-based approach for low-shot temporal action localization that leverages multi-modal prompts and improved embeddings to detect and classify actions in videos, even with limited or no training examples.

Contribution

It proposes a novel multi-modal prompting framework that aligns optical flow, RGB, and text embeddings, enhancing open-vocabulary classification in low-shot scenarios.

Findings

01

Outperforms state-of-the-art methods on THUMOS14 and ActivityNet1.3 datasets.

02

Demonstrates the effectiveness of multi-modal embedding alignment.

03

Shows significant improvements in low-shot action localization accuracy.

Abstract

In this paper, we consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario, with the goal of detecting and classifying the action instances from arbitrary categories within some untrimmed videos, even not seen at training time. We adopt a Transformer-based two-stage action localization architecture with class-agnostic action proposal, followed by open-vocabulary classification. We make the following contributions. First, to compensate image-text foundation models with temporal motions, we improve category-agnostic action proposal by explicitly aligning embeddings of optical flows, RGB and texts, which has largely been ignored in existing low-shot methods. Second, to improve open-vocabulary action classification, we construct classifiers with strong discriminative power, i.e., avoid lexical ambiguities. To be specific, we propose to prompt the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization

MethodsContrastive Language-Image Pre-training