TL;DR
SpeakerSleuth evaluates large audio-language models' ability to judge speaker consistency in multi-turn dialogues, revealing their bias towards textual cues over acoustic signals and highlighting areas for improvement.
Contribution
This work introduces a benchmark and dataset for assessing LALMs' speaker consistency judgment, exposing their limitations and modality biases in multi-turn dialogue evaluation.
Findings
Models struggle to reliably detect acoustic inconsistencies.
Performance drops when textual context is provided, favoring text over acoustics.
Models perform better in comparing and ranking acoustic variants.
Abstract
Large Audio-Language Models (LALMs) as judges have emerged as a prominent approach for evaluating speech generation quality, yet their ability to assess speaker consistency across multi-turn dialogues remains unexplored. We present \textbf{SpeakerSleuth}, a benchmark evaluating whether LALMs can reliably judge speaker consistency across multi-turn dialogues through three tasks reflecting real-world requirements. We construct 1,818 human-verified evaluation instances across four diverse datasets spanning synthetic and real speech, with controlled acoustic difficulty. Evaluating twelve widely-used LALMs, we find that models struggle to reliably detect acoustic inconsistencies. For instance, given audio samples of the same speaker's turns, some models overpredict inconsistency, whereas others are overly lenient. Models further struggle to identify the exact turns that are problematic. When…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
