QMAVIS: Long Video-Audio Understanding using Fusion of Large Multimodal Models

Zixing Lin; Jiale Wang; Gee Wah Ng; Lee Onn Mak; Chan Zhi Yang Jeriel; Jun Yang Lee; Yaohao Li

arXiv:2601.06573·cs.AI·January 13, 2026

QMAVIS: Long Video-Audio Understanding using Fusion of Large Multimodal Models

Zixing Lin, Jiale Wang, Gee Wah Ng, Lee Onn Mak, Chan Zhi Yang Jeriel, Jun Yang Lee, Yaohao Li

PDF

Open Access

TL;DR

QMAVIS is a novel long video-audio understanding pipeline that combines large multimodal models, language models, and speech recognition to analyze videos from minutes to over an hour, significantly improving understanding accuracy.

Contribution

Introduces QMAVIS, a late fusion pipeline for long video-audio understanding, addressing the challenge of analyzing extended videos beyond short clips with improved performance.

Findings

01

38.75% improvement over state-of-the-art models on VideoMME dataset

02

Up to 2% improvement on PerceptionTest and EgoSchema datasets

03

Effective extraction of scene nuances and overarching narratives in long videos

Abstract

Large Multimodal Models (LMMs) for video-audio understanding have traditionally been evaluated only on shorter videos of a few minutes long. In this paper, we introduce QMAVIS (Q Team-Multimodal Audio Video Intelligent Sensemaking), a novel long video-audio understanding pipeline built through a late fusion of LMMs, Large Language Models, and speech recognition models. QMAVIS addresses the gap in long-form video analytics, particularly for longer videos of a few minutes to beyond an hour long, opening up new potential applications in sensemaking, video content analysis, embodied AI, etc. Quantitative experiments using QMAVIS demonstrated a 38.75% improvement over state-of-the-art video-audio LMMs like VideoLlaMA2 and InternVL2 on the VideoMME (with subtitles) dataset, which comprises long videos with audio information. Evaluations on other challenging video understanding datasets like…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization