Training-free Video Temporal Grounding using Large-scale Pre-trained Models
Minghang Zheng, Xinhao Cai, Qingchao Chen, Yuxin Peng, Yang Liu

TL;DR
This paper introduces a training-free approach for video temporal grounding that leverages large pre-trained models, especially large language models, to identify relevant video segments without any additional training, improving generalization across datasets.
Contribution
The paper proposes a novel training-free method that combines large visual language models and large language models to improve zero-shot video temporal grounding and handle complex event relationships.
Findings
Achieves state-of-the-art zero-shot performance on Charades-STA and ActivityNet Captions.
Demonstrates superior generalization in cross-dataset and out-of-distribution scenarios.
Effectively models event relationships and transitions without training.
Abstract
Video temporal grounding aims to identify video segments within untrimmed videos that are most relevant to a given natural language query. Existing video temporal localization models rely on specific datasets for training and have high data collection costs, but they exhibit poor generalization capability under the across-dataset and out-of-distribution (OOD) settings. In this paper, we propose a Training-Free Video Temporal Grounding (TFVTG) approach that leverages the ability of pre-trained large models. A naive baseline is to enumerate proposals in the video and use the pre-trained visual language models (VLMs) to select the best proposal according to the vision-language alignment. However, most existing VLMs are trained on image-text pairs or trimmed video clip-text pairs, making it struggle to (1) grasp the relationship and distinguish the temporal boundaries of multiple events…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Multimodal Machine Learning Applications
