M3-SLU: Evaluating Speaker-Attributed Reasoning in Multimodal Large Language Models
Yejin Kwon, Taewoo Kang, Hyunsoo Yoon, Changouk Kim

TL;DR
M3-SLU introduces a comprehensive benchmark for evaluating multimodal large language models' ability to understand multi-speaker conversations, highlighting significant challenges in speaker attribution despite advances in speech and text comprehension.
Contribution
The paper presents M3-SLU, a new benchmark dataset and evaluation framework specifically designed to assess speaker-attributed reasoning in multimodal large language models.
Findings
Models excel at understanding what was said.
Models struggle with identifying who said it.
Speaker attribution remains a key challenge in multimodal dialogue understanding.
Abstract
We present M3-SLU, a new multimodal large language model (MLLM) benchmark for evaluating multi-speaker, multi-turn spoken language understanding. While recent models show strong performance in speech and text comprehension, they still struggle with speaker-attributed reasoning, the ability to understand who said what and when in natural conversations. M3-SLU is built from four open corpora (CHiME-6, MELD, MultiDialog, and AMI) and comprises over 12,000 validated instances with paired audio, transcripts, and metadata. It includes two tasks: (1) Speaker-Attributed Question Answering and (2) Speaker Attribution via Utterance Matching. We provide baseline results for both cascaded pipelines and end-to-end MLLMs, evaluated using an LLM-as-Judge and accuracy metrics. Results show that while models can capture what was said, they often fail to identify who said it, revealing a key gap in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · Speech Recognition and Synthesis
