See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Le Thien Phuc Nguyen; Zhuoran Yu; Samuel Low Yu Hang; Subin An; Jeongik Lee; Yohan Ban; SeungEun Chung; Thanh-Huy Nguyen; JuWan Maeng; Soochahn Lee; Yong Jae Lee

arXiv:2512.02231·cs.CV·April 13, 2026

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Le Thien Phuc Nguyen, Zhuoran Yu, Samuel Low Yu Hang, Subin An, Jeongik Lee, Yohan Ban, SeungEun Chung, Thanh-Huy Nguyen, JuWan Maeng, Soochahn Lee, Yong Jae Lee

PDF

1 Repo 2 Datasets

TL;DR

This paper introduces AV-SpeakerBench, a benchmark for evaluating multimodal large language models' ability to understand and reason about human speech in videos, emphasizing audiovisual alignment and speaker-centric reasoning.

Contribution

The paper presents a new benchmark with curated questions focusing on speaker-centric audiovisual reasoning, along with comprehensive evaluations of existing models' performance.

Findings

01

Gemini 2.5 Pro achieves the best results on AV-SpeakerBench.

02

Open-source models lag behind Gemini 2.5 Pro mainly due to weaker audiovisual fusion.

03

Qwen3-Omni-30B approaches Gemini 2.0 Flash but still underperforms.

Abstract

Multimodal large language models (MLLMs) are expected to jointly interpret vision, audio, and language, yet existing video benchmarks rarely assess fine-grained reasoning about human speech. Many tasks remain visually solvable or only coarsely evaluate speech, offering limited insight into whether models can align who speaks, what is said, and when it occurs. We introduce AV-SpeakerBench, a curated benchmark of 3,212 multiple-choice questions focused on speaker-centric audiovisual reasoning in real-world videos. It features: (1) a speaker-centered formulation that treats speakers-not scenes-as the core reasoning unit; (2) fusion-grounded question design embedding audiovisual dependencies into question semantics; and (3) expert-curated annotations ensuring temporal precision and cross-modal validity. Comprehensive evaluations show that the Gemini family consistently outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

plnguyen2908/AV-SpeakerBench
github

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.