LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding

ZhaoYang Han; Qihan Lin; Hao Liang; Bowen Chen; Zhou Liu; Wentao Zhang

arXiv:2510.17305·cs.CV·October 22, 2025

LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding

ZhaoYang Han, Qihan Lin, Hao Liang, Bowen Chen, Zhou Liu, Wentao Zhang

PDF

Open Access

TL;DR

LongInsightBench is a new comprehensive benchmark for evaluating omni-modal models' ability to understand long, human-centric videos by integrating visual, audio, and text modalities across diverse, challenging scenarios.

Contribution

It introduces the first benchmark focused on long-video understanding with multi-modal data, diverse tasks, and rigorous quality assurance pipelines.

Findings

01

Omni-modal models struggle with precise temporal localization.

02

Long-range causal inference remains challenging for current models.

03

Multi-modal fusion can lead to information loss and processing bias.

Abstract

We introduce \textbf{LongInsightBench}, the first benchmark designed to assess models' ability to understand long videos, with a focus on human language, viewpoints, actions, and other contextual elements, while integrating \textbf{visual, audio, and text} modalities. Our benchmark excels in three key areas: \textbf{a) Long-Duration, Information-Dense Videos:} We carefully select approximately 1,000 videos from open-source datasets FineVideo based on duration limit and the information density of both visual and audio modalities, focusing on content like lectures, interviews, and vlogs, which contain rich language elements. \textbf{b) Diverse and Challenging Task Scenarios:} We have designed six challenging task scenarios, including both Intra-Event and Inter-Event Tasks. \textbf{c) Rigorous and Comprehensive Quality Assurance Pipelines:} We have developed a three-step, semi-automated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning