From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding
Heqing Zou, Tianze Luo, Guiyang Xie, Victor (Xiao Jie) Zhang, Fengmao, Lv, Guangcong Wang, Junyang Chen, Zhuochen Wang, Hansheng Zhang, Huaijian, Zhang

TL;DR
This paper reviews the progress of MultiModal Large Language Models in understanding long videos, highlighting unique challenges and advancements in model design and training for long-term temporal dependencies.
Contribution
It provides a comprehensive survey of MM-LLMs from image to long video understanding, emphasizing differences, challenges, and future directions.
Findings
Long videos require modeling long-term dependencies.
Advancements in model design improve long video understanding.
Performance varies across video lengths and benchmarks.
Abstract
The integration of Large Language Models (LLMs) with visual encoders has recently shown promising performance in visual understanding tasks, leveraging their inherent capability to comprehend and generate human-like text for visual reasoning. Given the diverse nature of visual data, MultiModal Large Language Models (MM-LLMs) exhibit variations in model designing and training for understanding images, short videos, and long videos. Our paper focuses on the substantial differences and unique challenges posed by long video understanding compared to static image and short video understanding. Unlike static images, short videos encompass sequential frames with both spatial and within-event temporal information, while long videos consist of multiple events with between-event and long-term temporal information. In this survey, we aim to trace and summarize the advancements of MM-LLMs from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
