Open-Vocabulary Temporal Action Localization using Multimodal Guidance
Akshita Gupta, Aditya Arora, Sanath Narayan, Salman Khan, Fahad, Shahbaz Khan, Graham W. Taylor

TL;DR
This paper introduces OVFormer, a novel framework for open-vocabulary temporal action localization that leverages multimodal guidance and a two-stage training strategy to recognize both seen and unseen action categories in videos.
Contribution
OVFormer extends existing action localization methods with multimodal prompts, cross-attention, and a two-stage training process for open-vocabulary recognition.
Findings
Effective in recognizing novel categories in videos.
Outperforms existing methods on THUMOS14 and ActivityNet-1.3.
Generalizes well to unseen action classes.
Abstract
Open-Vocabulary Temporal Action Localization (OVTAL) enables a model to recognize any desired action category in videos without the need to explicitly curate training data for all categories. However, this flexibility poses significant challenges, as the model must recognize not only the action categories seen during training but also novel categories specified at inference. Unlike standard temporal action localization, where training and test categories are predetermined, OVTAL requires understanding contextual cues that reveal the semantics of novel categories. To address these challenges, we introduce OVFormer, a novel open-vocabulary framework extending ActionFormer with three key contributions. First, we employ task-specific prompts as input to a large language model to obtain rich class-specific descriptions for action categories. Second, we introduce a cross-attention mechanism…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques
