See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
Le Thien Phuc Nguyen, Zhuoran Yu, Samuel Low Yu Hang, Subin An, Jeongik Lee, Yohan Ban, SeungEun Chung, Thanh-Huy Nguyen, JuWan Maeng, Soochahn Lee, Yong Jae Lee

TL;DR
This paper introduces AV-SpeakerBench, a benchmark for evaluating multimodal large language models' ability to understand and reason about human speech in videos, emphasizing audiovisual alignment and speaker-centric reasoning.
Contribution
The paper presents a new benchmark with curated questions focusing on speaker-centric audiovisual reasoning, along with comprehensive evaluations of existing models' performance.
Findings
Gemini 2.5 Pro achieves the best results on AV-SpeakerBench.
Open-source models lag behind Gemini 2.5 Pro mainly due to weaker audiovisual fusion.
Qwen3-Omni-30B approaches Gemini 2.0 Flash but still underperforms.
Abstract
Multimodal large language models (MLLMs) are expected to jointly interpret vision, audio, and language, yet existing video benchmarks rarely assess fine-grained reasoning about human speech. Many tasks remain visually solvable or only coarsely evaluate speech, offering limited insight into whether models can align who speaks, what is said, and when it occurs. We introduce AV-SpeakerBench, a curated benchmark of 3,212 multiple-choice questions focused on speaker-centric audiovisual reasoning in real-world videos. It features: (1) a speaker-centered formulation that treats speakers-not scenes-as the core reasoning unit; (2) fusion-grounded question design embedding audiovisual dependencies into question semantics; and (3) expert-curated annotations ensuring temporal precision and cross-modal validity. Comprehensive evaluations show that the Gemini family consistently outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
