Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat-Seng Chua,, Yueting Zhuang, Siliang Tang

TL;DR
Momentor is a novel Video-LLM that achieves fine-grained temporal reasoning and localization in videos, overcoming previous models' limitations in segment-level understanding by training on a large-scale, automatically generated dataset.
Contribution
The paper introduces Momentor, a Video-LLM capable of fine-grained temporal reasoning, supported by a new large-scale dataset, Moment-10M, for training and evaluation.
Findings
Momentor outperforms existing models in fine-grained temporal understanding.
It demonstrates strong zero-shot performance on localization tasks.
The dataset enables effective training of segment-level reasoning models.
Abstract
Large Language Models (LLMs) demonstrate remarkable proficiency in comprehending and handling text-based tasks. Many efforts are being made to transfer these attributes to video modality, which are termed Video-LLMs. However, existing Video-LLMs can only capture the coarse-grained semantics and are unable to effectively handle tasks related to comprehension or localization of specific video segments. In light of these challenges, we propose Momentor, a Video-LLM capable of accomplishing fine-grained temporal understanding tasks. To support the training of Momentor, we design an automatic data generation engine to construct Moment-10M, a large-scale video instruction dataset with segment-level instruction data. We train Momentor on Moment-10M, enabling it to perform segment-level reasoning and localization. Zero-shot evaluations on several tasks demonstrate that Momentor excels in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
