MSU-Bench: Towards Understanding the Conversational Multi-talker Scenarios

Shuai Wang; Zhaokai Sun; Zhennan Lin; Chengyou Wang; Zhou Pan; Lei Xie

arXiv:2508.08155·eess.AS·August 12, 2025

MSU-Bench: Towards Understanding the Conversational Multi-talker Scenarios

Shuai Wang, Zhaokai Sun, Zhennan Lin, Chengyou Wang, Zhou Pan, Lei Xie

PDF

Open Access

TL;DR

MSU-Bench is a new comprehensive benchmark designed to evaluate multi-speaker conversational understanding, revealing significant performance gaps in current models across increasingly complex tasks in realistic scenarios.

Contribution

Introduces MSU-Bench, a hierarchical, speaker-centric benchmark for multi-speaker conversational understanding, covering four progressive tiers from perception to reasoning.

Findings

01

Models' performance declines with task complexity.

02

Open-source models lag behind commercial ones in multi-speaker reasoning.

03

MSU-Bench effectively assesses conversational understanding in realistic environments.

Abstract

Spoken Language Understanding (SLU) has progressed from traditional single-task methods to large audio language model (LALM) solutions. Yet, most existing speech benchmarks focus on single-speaker or isolated tasks, overlooking the challenges posed by multi-speaker conversations that are common in real-world scenarios. We introduce MSU-Bench, a comprehensive benchmark for evaluating multi-speaker conversational understanding with a speaker-centric design. Our hierarchical framework covers four progressive tiers: single-speaker static attribute understanding, single-speaker dynamic attribute understanding, multi-speaker background understanding, and multi-speaker interaction understanding. This structure ensures all tasks are grounded in speaker-centric contexts, from basic perception to complex reasoning across multiple speakers. By evaluating state-of-the-art models on MSU-Bench, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing