VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding
Yongxin Guo, Jingyu Liu, Mingda Li, Dingxin Cheng, Xiaoying Tang,, Dianbo Sui, Qingbin Liu, Xi Chen, Kevin Zhao

TL;DR
This paper introduces VTG-LLM, a novel model that enhances video large language models' ability to accurately localize event timestamps in videos by integrating timestamp knowledge and employing efficient token compression, supported by a new dataset.
Contribution
We propose VTG-LLM, a model that effectively incorporates timestamp knowledge into video LLMs and introduces a new dataset, VTG-IT-120K, for improved video temporal grounding.
Findings
VTG-LLM outperforms existing methods in VTG tasks.
Effective timestamp knowledge integration improves localization accuracy.
The new dataset enhances annotation quality for VTG research.
Abstract
Video Temporal Grounding (VTG) strives to accurately pinpoint event timestamps in a specific video using linguistic queries, significantly impacting downstream tasks like video browsing and editing. Unlike traditional task-specific models, Video Large Language Models (video LLMs) can handle multiple tasks concurrently in a zero-shot manner. Consequently, exploring the application of video LLMs for VTG tasks has become a burgeoning research area. However, despite considerable advancements in video content understanding, video LLMs often struggle to accurately pinpoint timestamps within videos, limiting their effectiveness in VTG tasks. To address this, we introduce VTG-LLM, a model designed to enhance video LLMs' timestamp localization abilities. Our approach includes: (1) effectively integrating timestamp knowledge into visual tokens; (2) incorporating absolute-time tokens to manage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Multimodal Machine Learning Applications
