From Seconds to Hours: Reviewing MultiModal Large Language Models on   Comprehensive Long Video Understanding

Heqing Zou; Tianze Luo; Guiyang Xie; Victor (Xiao Jie) Zhang; Fengmao; Lv; Guangcong Wang; Junyang Chen; Zhuochen Wang; Hansheng Zhang; Huaijian; Zhang

arXiv:2409.18938·cs.CV·December 4, 2024

From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding

Heqing Zou, Tianze Luo, Guiyang Xie, Victor (Xiao Jie) Zhang, Fengmao, Lv, Guangcong Wang, Junyang Chen, Zhuochen Wang, Hansheng Zhang, Huaijian, Zhang

PDF

Open Access 1 Repo

TL;DR

This paper reviews the progress of MultiModal Large Language Models in understanding long videos, highlighting unique challenges and advancements in model design and training for long-term temporal dependencies.

Contribution

It provides a comprehensive survey of MM-LLMs from image to long video understanding, emphasizing differences, challenges, and future directions.

Findings

01

Long videos require modeling long-term dependencies.

02

Advancements in model design improve long video understanding.

03

Performance varies across video lengths and benchmarks.

Abstract

The integration of Large Language Models (LLMs) with visual encoders has recently shown promising performance in visual understanding tasks, leveraging their inherent capability to comprehend and generate human-like text for visual reasoning. Given the diverse nature of visual data, MultiModal Large Language Models (MM-LLMs) exhibit variations in model designing and training for understanding images, short videos, and long videos. Our paper focuses on the substantial differences and unique challenges posed by long video understanding compared to static image and short video understanding. Unlike static images, short videos encompass sequential frames with both spatial and within-event temporal information, while long videos consist of multiple events with between-event and long-term temporal information. In this survey, we aim to trace and summarize the advancements of MM-LLMs from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Vincent-ZHQ/LV-LLMs
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications