Zero-Shot Temporal Action Localization Through Textual Guidance
Benedetta Liberatori, Alessandro Conti, Lorenzo Vaquero, Paolo Rota, Yiming Wang, Elisa Ricci

TL;DR
This paper introduces TEGU, a zero-shot temporal action localization method that leverages rich textual information from large language models to improve fine-grained action discrimination without training on labeled data.
Contribution
The paper proposes a novel zero-shot localization approach using textual guidance from large language models, enhancing fine-grained action discrimination without requiring annotated training data.
Findings
TEGU outperforms existing zero-shot methods on THUMOS14 and ActivityNet-v1.3.
Rich textual cues improve fine-grained action localization accuracy.
Using structured text from captions enhances discrimination between similar actions.
Abstract
Zero-shot temporal action localization (ZS-TAL) consists of classifying and localizing actions in untrimmed videos, where action classes are unseen at training time. Existing work uses Vision and Language Models (VLMs), taking advantage of their strong zero-shot transfer capabilities. Yet, these models face evident challenges with fine-grained action classification, making it difficult to directly use them to distinguish between the presence and absence of an action. Most current methods for ZS-TAL address these challenges by training models on large-scale video datasets, which require annotated data and often result in limited generalization performance. Recently, approaches discarding the use of labeled data have emerged as an alternative. Following this direction, we propose a novel approach, ``Textual Guidance for finer localization of actions in videos'' (TEGU), that compensates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
