Momentor: Advancing Video Large Language Model with Fine-Grained   Temporal Reasoning

Long Qian; Juncheng Li; Yu Wu; Yaobo Ye; Hao Fei; Tat-Seng Chua,; Yueting Zhuang; Siliang Tang

arXiv:2402.11435·cs.CV·June 4, 2024·2 cites

Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning

Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat-Seng Chua,, Yueting Zhuang, Siliang Tang

PDF

Open Access 1 Repo

TL;DR

Momentor is a novel Video-LLM that achieves fine-grained temporal reasoning and localization in videos, overcoming previous models' limitations in segment-level understanding by training on a large-scale, automatically generated dataset.

Contribution

The paper introduces Momentor, a Video-LLM capable of fine-grained temporal reasoning, supported by a new large-scale dataset, Moment-10M, for training and evaluation.

Findings

01

Momentor outperforms existing models in fine-grained temporal understanding.

02

It demonstrates strong zero-shot performance on localization tasks.

03

The dataset enables effective training of segment-level reasoning models.

Abstract

Large Language Models (LLMs) demonstrate remarkable proficiency in comprehending and handling text-based tasks. Many efforts are being made to transfer these attributes to video modality, which are termed Video-LLMs. However, existing Video-LLMs can only capture the coarse-grained semantics and are unable to effectively handle tasks related to comprehension or localization of specific video segments. In light of these challenges, we propose Momentor, a Video-LLM capable of accomplishing fine-grained temporal understanding tasks. To support the training of Momentor, we design an automatic data generation engine to construct Moment-10M, a large-scale video instruction dataset with segment-level instruction data. We train Momentor on Moment-10M, enabling it to perform segment-level reasoning and localization. Zero-shot evaluations on several tasks demonstrate that Momentor excels in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dcdmllm/momentor
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling