Open-Vocabulary Temporal Action Localization using Multimodal Guidance

Akshita Gupta; Aditya Arora; Sanath Narayan; Salman Khan; Fahad; Shahbaz Khan; Graham W. Taylor

arXiv:2406.15556·cs.CV·June 25, 2024

Open-Vocabulary Temporal Action Localization using Multimodal Guidance

Akshita Gupta, Aditya Arora, Sanath Narayan, Salman Khan, Fahad, Shahbaz Khan, Graham W. Taylor

PDF

Open Access

TL;DR

This paper introduces OVFormer, a novel framework for open-vocabulary temporal action localization that leverages multimodal guidance and a two-stage training strategy to recognize both seen and unseen action categories in videos.

Contribution

OVFormer extends existing action localization methods with multimodal prompts, cross-attention, and a two-stage training process for open-vocabulary recognition.

Findings

01

Effective in recognizing novel categories in videos.

02

Outperforms existing methods on THUMOS14 and ActivityNet-1.3.

03

Generalizes well to unseen action classes.

Abstract

Open-Vocabulary Temporal Action Localization (OVTAL) enables a model to recognize any desired action category in videos without the need to explicitly curate training data for all categories. However, this flexibility poses significant challenges, as the model must recognize not only the action categories seen during training but also novel categories specified at inference. Unlike standard temporal action localization, where training and test categories are predetermined, OVTAL requires understanding contextual cues that reveal the semantics of novel categories. To address these challenges, we introduce OVFormer, a novel open-vocabulary framework extending ActionFormer with three key contributions. First, we employ task-specific prompts as input to a large language model to obtain rich class-specific descriptions for action categories. Second, we introduce a cross-attention mechanism…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques