HAVEN: Hierarchically Aligned Multimodal Benchmark for Unified Video Understanding

Mengqi Shi; Haopeng Zhang

arXiv:2605.19223·cs.CV·May 20, 2026

HAVEN: Hierarchically Aligned Multimodal Benchmark for Unified Video Understanding

Mengqi Shi, Haopeng Zhang

PDF

TL;DR

HAVEN introduces a hierarchical, multimodal benchmark dataset for comprehensive evaluation of video understanding, addressing gaps in current summarization and reasoning tasks.

Contribution

It presents a fully granular, multimodal dataset with explicit alignments and a suite of evaluation tasks, advancing the assessment of hierarchical video understanding.

Findings

01

State-of-the-art models show gaps in grounded multimodal understanding.

02

HAVEN's benchmark exposes limitations of current models in hierarchical reasoning.

03

The dataset and protocols are publicly released for future research.

Abstract

While Multimodal Large Language Models (MLLMs) exhibit strong performance on standard video tasks, their ability to faithfully summarize and reason over complex narratives remains poorly evaluated. Existing summarization benchmarks fragment supervision across isolated granularities, such as keyframes, key shots, or disjointed text summaries, failing to capture the inherently hierarchical structure of cross-modal alignment. To address this critical gap, we introduce HAVEN, a hierarchically aligned multimodal benchmark for unified video understanding. HAVEN pioneers a fully granular (frame, shot, and video levels) and fully multimodal (video and text) dataset architecture, complete with explicit, continuous alignment between modalities. Built upon this unified annotation paradigm, we propose a comprehensive evaluation suite spanning summarization, temporal reasoning, multimodal grounding,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.