M3-SLU: Evaluating Speaker-Attributed Reasoning in Multimodal Large Language Models

Yejin Kwon; Taewoo Kang; Hyunsoo Yoon; Changouk Kim

arXiv:2510.19358·cs.CL·October 23, 2025

M3-SLU: Evaluating Speaker-Attributed Reasoning in Multimodal Large Language Models

Yejin Kwon, Taewoo Kang, Hyunsoo Yoon, Changouk Kim

PDF

Open Access

TL;DR

M3-SLU introduces a comprehensive benchmark for evaluating multimodal large language models' ability to understand multi-speaker conversations, highlighting significant challenges in speaker attribution despite advances in speech and text comprehension.

Contribution

The paper presents M3-SLU, a new benchmark dataset and evaluation framework specifically designed to assess speaker-attributed reasoning in multimodal large language models.

Findings

01

Models excel at understanding what was said.

02

Models struggle with identifying who said it.

03

Speaker attribution remains a key challenge in multimodal dialogue understanding.

Abstract

We present M3-SLU, a new multimodal large language model (MLLM) benchmark for evaluating multi-speaker, multi-turn spoken language understanding. While recent models show strong performance in speech and text comprehension, they still struggle with speaker-attributed reasoning, the ability to understand who said what and when in natural conversations. M3-SLU is built from four open corpora (CHiME-6, MELD, MultiDialog, and AMI) and comprises over 12,000 validated instances with paired audio, transcripts, and metadata. It includes two tasks: (1) Speaker-Attributed Question Answering and (2) Speaker Attribution via Utterance Matching. We provide baseline results for both cascaded pipelines and end-to-end MLLMs, evaluated using an LLM-as-Judge and accuracy metrics. Results show that while models can capture what was said, they often fail to identify who said it, revealing a key gap in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · Speech Recognition and Synthesis