VTimeLLM: Empower LLM to Grasp Video Moments
Bin Huang, Xin Wang, Hong Chen, Zihan Song, Wenwu Zhu

TL;DR
VTimeLLM is a novel Video LLM that achieves fine-grained temporal understanding of videos, accurately identifying event boundaries and improving performance in tasks like temporal grounding and dense captioning.
Contribution
It introduces a boundary-aware three-stage training strategy for Video LLMs, enhancing temporal boundary detection and reasoning capabilities.
Findings
Significantly outperforms existing Video LLMs in temporal grounding and captioning.
Demonstrates superior performance in video dialogue tasks.
Enhances cross-modal understanding and reasoning abilities.
Abstract
Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details. However, existing Video LLMs can only provide a coarse description of the entire video, failing to capture the precise start and end time boundary of specific events. In this paper, we solve this issue via proposing VTimeLLM, a novel Video LLM designed for fine-grained video moment understanding and reasoning with respect to time boundary. Specifically, our VTimeLLM adopts a boundary-aware three-stage training strategy, which respectively utilizes image-text pairs for feature alignment, multiple-event videos to increase temporal-boundary awareness, and high-quality video-instruction tuning to further improve temporal understanding ability as well as align with human intents. Extensive experiments demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
MethodsALIGN
