LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs
Keda Tao, Yuhua Zheng, Jia Xu, Wenjie Du, Kele Shao, Hesong Wang, Xueyi Chen, Xin Jin, Junhan Zhu, Bohan Yu, Weiqiang Wang, Jian Liu, Can Qin, Yulun Zhang, Ming-Hsuan Yang, Huan Wang

TL;DR
LVOmniBench introduces a comprehensive benchmark for evaluating long-form audio-visual understanding in omnimodal large language models, highlighting current models' limitations on extended videos and fostering future advancements.
Contribution
The paper presents LVOmniBench, a new dataset and evaluation framework for long-duration audio-visual comprehension in OmniLLMs, addressing a significant gap in existing short-clip benchmarks.
Findings
Current OmniLLMs struggle with long videos, achieving below 35% accuracy.
Gemini 3 Pro attains around 65% accuracy on the benchmark.
Long-form audio-visual understanding remains a challenging research area.
Abstract
Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes. To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video. This dataset comprises high-quality videos sourced from open platforms that feature rich audio-visual dynamics. Through rigorous manual selection and annotation, LVOmniBench comprises 275 videos, ranging in duration from 10 to 90 minutes, and 1,014 question-answer (QA) pairs. LVOmniBench aims to rigorously evaluate the capabilities of OmniLLMs across domains, including long-term…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis
