Zero-Shot Temporal Action Localization Through Textual Guidance

Benedetta Liberatori; Alessandro Conti; Lorenzo Vaquero; Paolo Rota; Yiming Wang; Elisa Ricci

arXiv:2605.22201·cs.CV·May 22, 2026

Zero-Shot Temporal Action Localization Through Textual Guidance

Benedetta Liberatori, Alessandro Conti, Lorenzo Vaquero, Paolo Rota, Yiming Wang, Elisa Ricci

PDF

TL;DR

This paper introduces TEGU, a zero-shot temporal action localization method that leverages rich textual information from large language models to improve fine-grained action discrimination without training on labeled data.

Contribution

The paper proposes a novel zero-shot localization approach using textual guidance from large language models, enhancing fine-grained action discrimination without requiring annotated training data.

Findings

01

TEGU outperforms existing zero-shot methods on THUMOS14 and ActivityNet-v1.3.

02

Rich textual cues improve fine-grained action localization accuracy.

03

Using structured text from captions enhances discrimination between similar actions.

Abstract

Zero-shot temporal action localization (ZS-TAL) consists of classifying and localizing actions in untrimmed videos, where action classes are unseen at training time. Existing work uses Vision and Language Models (VLMs), taking advantage of their strong zero-shot transfer capabilities. Yet, these models face evident challenges with fine-grained action classification, making it difficult to directly use them to distinguish between the presence and absence of an action. Most current methods for ZS-TAL address these challenges by training models on large-scale video datasets, which require annotated data and often result in limited generalization performance. Recently, approaches discarding the use of labeled data have emerged as an alternative. Following this direction, we propose a novel approach, ``Textual Guidance for finer localization of actions in videos'' (TEGU), that compensates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.