VTimeLLM: Empower LLM to Grasp Video Moments

Bin Huang; Xin Wang; Hong Chen; Zihan Song; Wenwu Zhu

arXiv:2311.18445·cs.CV·December 1, 2023·1 cites

VTimeLLM: Empower LLM to Grasp Video Moments

Bin Huang, Xin Wang, Hong Chen, Zihan Song, Wenwu Zhu

PDF

Open Access 1 Repo

TL;DR

VTimeLLM is a novel Video LLM that achieves fine-grained temporal understanding of videos, accurately identifying event boundaries and improving performance in tasks like temporal grounding and dense captioning.

Contribution

It introduces a boundary-aware three-stage training strategy for Video LLMs, enhancing temporal boundary detection and reasoning capabilities.

Findings

01

Significantly outperforms existing Video LLMs in temporal grounding and captioning.

02

Demonstrates superior performance in video dialogue tasks.

03

Enhances cross-modal understanding and reasoning abilities.

Abstract

Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details. However, existing Video LLMs can only provide a coarse description of the entire video, failing to capture the precise start and end time boundary of specific events. In this paper, we solve this issue via proposing VTimeLLM, a novel Video LLM designed for fine-grained video moment understanding and reasoning with respect to time boundary. Specifically, our VTimeLLM adopts a boundary-aware three-stage training strategy, which respectively utilizes image-text pairs for feature alignment, multiple-event videos to increase temporal-boundary awareness, and high-quality video-instruction tuning to further improve temporal understanding ability as well as align with human intents. Extensive experiments demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

huangb23/vtimellm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques

MethodsALIGN