QMAVIS: Long Video-Audio Understanding using Fusion of Large Multimodal Models
Zixing Lin, Jiale Wang, Gee Wah Ng, Lee Onn Mak, Chan Zhi Yang Jeriel, Jun Yang Lee, Yaohao Li

TL;DR
QMAVIS is a novel long video-audio understanding pipeline that combines large multimodal models, language models, and speech recognition to analyze videos from minutes to over an hour, significantly improving understanding accuracy.
Contribution
Introduces QMAVIS, a late fusion pipeline for long video-audio understanding, addressing the challenge of analyzing extended videos beyond short clips with improved performance.
Findings
38.75% improvement over state-of-the-art models on VideoMME dataset
Up to 2% improvement on PerceptionTest and EgoSchema datasets
Effective extraction of scene nuances and overarching narratives in long videos
Abstract
Large Multimodal Models (LMMs) for video-audio understanding have traditionally been evaluated only on shorter videos of a few minutes long. In this paper, we introduce QMAVIS (Q Team-Multimodal Audio Video Intelligent Sensemaking), a novel long video-audio understanding pipeline built through a late fusion of LMMs, Large Language Models, and speech recognition models. QMAVIS addresses the gap in long-form video analytics, particularly for longer videos of a few minutes to beyond an hour long, opening up new potential applications in sensemaking, video content analysis, embodied AI, etc. Quantitative experiments using QMAVIS demonstrated a 38.75% improvement over state-of-the-art video-audio LMMs like VideoLlaMA2 and InternVL2 on the VideoMME (with subtitles) dataset, which comprises long videos with audio information. Evaluations on other challenging video understanding datasets like…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization
