MMViR: A Multi-Modal and Multi-Granularity Representation for Long-range Video Understanding

Zizhong Li; Haopeng Zhang; Jiawei Zhang

arXiv:2601.05495·cs.CV·January 12, 2026

MMViR: A Multi-Modal and Multi-Granularity Representation for Long-range Video Understanding

Zizhong Li, Haopeng Zhang, Jiawei Zhang

PDF

Open Access

TL;DR

This paper introduces MMViR, a multi-modal, multi-grained structured representation for long video understanding that improves performance and efficiency in tasks like QA, summarization, and retrieval.

Contribution

MMViR is a novel structured representation that segments long videos at key points and combines global and fine-grained descriptions for better understanding.

Findings

01

Achieves 19.67% improvement in long video understanding accuracy.

02

Reduces processing latency to 45.4% of previous methods.

03

Outperforms prior methods across QA, summarization, and retrieval tasks.

Abstract

Long videos, ranging from minutes to hours, present significant challenges for current Multi-modal Large Language Models (MLLMs) due to their complex events, diverse scenes, and long-range dependencies. Direct encoding of such videos is computationally too expensive, while simple video-to-text conversion often results in redundant or fragmented content. To address these limitations, we introduce MMViR, a novel multi-modal, multi-grained structured representation for long video understanding. MMViR identifies key turning points to segment the video and constructs a three-level description that couples global narratives with fine-grained visual details. This design supports efficient query-based retrieval and generalizes well across various scenarios. Extensive evaluations across three tasks, including QA, summarization, and retrieval, show that MMViR outperforms the prior strongest…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition