MMViR: A Multi-Modal and Multi-Granularity Representation for Long-range Video Understanding
Zizhong Li, Haopeng Zhang, Jiawei Zhang

TL;DR
This paper introduces MMViR, a multi-modal, multi-grained structured representation for long video understanding that improves performance and efficiency in tasks like QA, summarization, and retrieval.
Contribution
MMViR is a novel structured representation that segments long videos at key points and combines global and fine-grained descriptions for better understanding.
Findings
Achieves 19.67% improvement in long video understanding accuracy.
Reduces processing latency to 45.4% of previous methods.
Outperforms prior methods across QA, summarization, and retrieval tasks.
Abstract
Long videos, ranging from minutes to hours, present significant challenges for current Multi-modal Large Language Models (MLLMs) due to their complex events, diverse scenes, and long-range dependencies. Direct encoding of such videos is computationally too expensive, while simple video-to-text conversion often results in redundant or fragmented content. To address these limitations, we introduce MMViR, a novel multi-modal, multi-grained structured representation for long video understanding. MMViR identifies key turning points to segment the video and constructs a three-level description that couples global narratives with fine-grained visual details. This design supports efficient query-based retrieval and generalizes well across various scenarios. Extensive evaluations across three tasks, including QA, summarization, and retrieval, show that MMViR outperforms the prior strongest…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
