Training-free Video Temporal Grounding using Large-scale Pre-trained   Models

Minghang Zheng; Xinhao Cai; Qingchao Chen; Yuxin Peng; Yang Liu

arXiv:2408.16219·cs.CV·August 30, 2024

Training-free Video Temporal Grounding using Large-scale Pre-trained Models

Minghang Zheng, Xinhao Cai, Qingchao Chen, Yuxin Peng, Yang Liu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a training-free approach for video temporal grounding that leverages large pre-trained models, especially large language models, to identify relevant video segments without any additional training, improving generalization across datasets.

Contribution

The paper proposes a novel training-free method that combines large visual language models and large language models to improve zero-shot video temporal grounding and handle complex event relationships.

Findings

01

Achieves state-of-the-art zero-shot performance on Charades-STA and ActivityNet Captions.

02

Demonstrates superior generalization in cross-dataset and out-of-distribution scenarios.

03

Effectively models event relationships and transitions without training.

Abstract

Video temporal grounding aims to identify video segments within untrimmed videos that are most relevant to a given natural language query. Existing video temporal localization models rely on specific datasets for training and have high data collection costs, but they exhibit poor generalization capability under the across-dataset and out-of-distribution (OOD) settings. In this paper, we propose a Training-Free Video Temporal Grounding (TFVTG) approach that leverages the ability of pre-trained large models. A naive baseline is to enumerate proposals in the video and use the pre-trained visual language models (VLMs) to select the best proposal according to the vision-language alignment. However, most existing VLMs are trained on image-text pairs or trimmed video clip-text pairs, making it struggle to (1) grasp the relationship and distinguish the temporal boundaries of multiple events…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

minghangz/tfvtg
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Multimodal Machine Learning Applications