VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced   Video Temporal Grounding

Yongxin Guo; Jingyu Liu; Mingda Li; Dingxin Cheng; Xiaoying Tang,; Dianbo Sui; Qingbin Liu; Xi Chen; Kevin Zhao

arXiv:2405.13382·cs.CV·February 4, 2025·3 cites

VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding

Yongxin Guo, Jingyu Liu, Mingda Li, Dingxin Cheng, Xiaoying Tang,, Dianbo Sui, Qingbin Liu, Xi Chen, Kevin Zhao

PDF

Open Access 1 Repo 1 Models 2 Datasets

TL;DR

This paper introduces VTG-LLM, a novel model that enhances video large language models' ability to accurately localize event timestamps in videos by integrating timestamp knowledge and employing efficient token compression, supported by a new dataset.

Contribution

We propose VTG-LLM, a model that effectively incorporates timestamp knowledge into video LLMs and introduces a new dataset, VTG-IT-120K, for improved video temporal grounding.

Findings

01

VTG-LLM outperforms existing methods in VTG tasks.

02

Effective timestamp knowledge integration improves localization accuracy.

03

The new dataset enhances annotation quality for VTG research.

Abstract

Video Temporal Grounding (VTG) strives to accurately pinpoint event timestamps in a specific video using linguistic queries, significantly impacting downstream tasks like video browsing and editing. Unlike traditional task-specific models, Video Large Language Models (video LLMs) can handle multiple tasks concurrently in a zero-shot manner. Consequently, exploring the application of video LLMs for VTG tasks has become a burgeoning research area. However, despite considerable advancements in video content understanding, video LLMs often struggle to accurately pinpoint timestamps within videos, limiting their effectiveness in VTG tasks. To address this, we introduce VTG-LLM, a model designed to enhance video LLMs' timestamp localization abilities. Our approach includes: (1) effectively integrating timestamp knowledge into visual tokens; (2) incorporating absolute-time tokens to manage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gyxxyg/vtg-llm
pytorchOfficial

Models

🤗
Yongxin-Guo/VTG-LLM
model· ♡ 3
♡ 3

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Multimodal Machine Learning Applications