Just ASR + LLM? A Study on Speech Large Language Models' Ability to   Identify and Understand Speaker in Spoken Dialogue

Junkai Wu; Xulin Fan; Bo-Ru Lu; Xilin Jiang; Nima Mesgarani; Mark; Hasegawa-Johnson; Mari Ostendorf

arXiv:2409.04927·cs.CL·October 3, 2024

Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue

Junkai Wu, Xulin Fan, Bo-Ru Lu, Xilin Jiang, Nima Mesgarani, Mark, Hasegawa-Johnson, Mari Ostendorf

PDF

Open Access 1 Repo

TL;DR

This paper evaluates speech large language models' ability to identify and understand speakers in spoken dialogue, revealing they mainly rely on conversation content rather than speaker identity, and proposes better evaluation tasks.

Contribution

The study highlights the limited speaker awareness of current SpeechLLMs and introduces a new dataset and evaluation framework focused on speaker identity recognition.

Findings

01

SpeechLLMs perform well on context-based questions without speaker identification.

02

Models show limited ability to recognize speaker identities from audio.

03

Proposed tasks better evaluate speaker recognition capabilities in SpeechLLMs.

Abstract

In recent years, we have observed a rapid advancement in speech language models (SpeechLLMs), catching up with humans' listening and reasoning abilities. SpeechLLMs have demonstrated impressive spoken dialog question-answering (SQA) performance in benchmarks like Gaokao, the English listening test of the college entrance exam in China, which seemingly requires understanding both the spoken content and voice characteristics of speakers in a conversation. However, after carefully examining Gaokao's questions, we find the correct answers to many questions can be inferred from the conversation transcript alone, i.e.\ without speaker segmentation and identification. Our evaluation of state-of-the-art models Qwen-Audio and WavLLM on both Gaokao and our proposed "What Do You Like?" dataset shows a significantly higher accuracy in these context-based questions than in identity-critical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wjk0925/slt2024-speechllm-speaker-understanding
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems