TAG: A Simple Yet Effective Temporal-Aware Approach for Zero-Shot Video Temporal Grounding

Jin-Seop Lee; SungJoon Lee; Jaehan Ahn; YunSeok Choi; Jee-Hyong Lee

arXiv:2508.07925·cs.CV·August 12, 2025

TAG: A Simple Yet Effective Temporal-Aware Approach for Zero-Shot Video Temporal Grounding

Jin-Seop Lee, SungJoon Lee, Jaehan Ahn, YunSeok Choi, Jee-Hyong Lee

PDF

Open Access

TL;DR

This paper introduces TAG, a straightforward temporal-aware method for zero-shot video temporal grounding that improves localization accuracy by capturing temporal context and addressing similarity distortions without additional training.

Contribution

The paper presents a novel zero-shot VTG approach incorporating temporal pooling, coherence clustering, and similarity adjustment, outperforming existing methods without relying on LLMs.

Findings

01

Achieves state-of-the-art results on Charades-STA and ActivityNet Captions datasets.

02

Effectively captures temporal context and reduces semantic fragmentation.

03

Does not require training or large language models.

Abstract

Video Temporal Grounding (VTG) aims to extract relevant video segments based on a given natural language query. Recently, zero-shot VTG methods have gained attention by leveraging pretrained vision-language models (VLMs) to localize target moments without additional training. However, existing approaches suffer from semantic fragmentation, where temporally continuous frames sharing the same semantics are split across multiple segments. When segments are fragmented, it becomes difficult to predict an accurate target moment that aligns with the text query. Also, they rely on skewed similarity distributions for localization, making it difficult to select the optimal segment. Furthermore, they heavily depend on the use of LLMs which require expensive inferences. To address these limitations, we propose a \textit{TAG}, a simple yet effective Temporal-Aware approach for zero-shot video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition