SpeakerSleuth: Can Large Audio-Language Models Judge Speaker Consistency across Multi-turn Dialogues?

Jonggeun Lee; Junseong Pyo; Gyuhyeon Seo; Yohan Jo

arXiv:2601.04029·cs.CL·April 21, 2026

SpeakerSleuth: Can Large Audio-Language Models Judge Speaker Consistency across Multi-turn Dialogues?

Jonggeun Lee, Junseong Pyo, Gyuhyeon Seo, Yohan Jo

PDF

1 Repo

TL;DR

SpeakerSleuth evaluates large audio-language models' ability to judge speaker consistency in multi-turn dialogues, revealing their bias towards textual cues over acoustic signals and highlighting areas for improvement.

Contribution

This work introduces a benchmark and dataset for assessing LALMs' speaker consistency judgment, exposing their limitations and modality biases in multi-turn dialogue evaluation.

Findings

01

Models struggle to reliably detect acoustic inconsistencies.

02

Performance drops when textual context is provided, favoring text over acoustics.

03

Models perform better in comparing and ranking acoustic variants.

Abstract

Large Audio-Language Models (LALMs) as judges have emerged as a prominent approach for evaluating speech generation quality, yet their ability to assess speaker consistency across multi-turn dialogues remains unexplored. We present \textbf{SpeakerSleuth}, a benchmark evaluating whether LALMs can reliably judge speaker consistency across multi-turn dialogues through three tasks reflecting real-world requirements. We construct 1,818 human-verified evaluation instances across four diverse datasets spanning synthetic and real speech, with controlled acoustic difficulty. Evaluating twelve widely-used LALMs, we find that models struggle to reliably detect acoustic inconsistencies. For instance, given audio samples of the same speaker's turns, some models overpredict inconsistency, whereas others are overly lenient. Models further struggle to identify the exact turns that are problematic. When…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

holi-lab/SpeakerSleuth
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.