Who Spoke What When? Evaluating Spoken Language Models for Conversational ASR with Semantic and Overlap-Aware Metrics

Naohiro Tawara; Samuele Cornell; Alexander Polok; Marc Delcroix; Luk\'a\v{s} Burget; Shinji Watanabe

arXiv:2603.22709·cs.CL·March 25, 2026

Who Spoke What When? Evaluating Spoken Language Models for Conversational ASR with Semantic and Overlap-Aware Metrics

Naohiro Tawara, Samuele Cornell, Alexander Polok, Marc Delcroix, Luk\'a\v{s} Burget, Shinji Watanabe

PDF

Open Access

TL;DR

This paper evaluates spoken language models for conversational ASR, introducing new metrics to better assess semantic accuracy and robustness in multi-speaker scenarios, revealing strengths and weaknesses of LLM-based and modular approaches.

Contribution

It introduces tcpSemER, a semantic-aware metric, and provides a systematic comparison of LLM-based and modular ASR systems across challenging multi-speaker conditions.

Findings

01

LLM-based systems perform well with two speakers but degrade with more overlap.

02

Modular pipelines are more robust to increasing speaker count and overlap.

03

tcpSemER captures meaning errors missed by traditional metrics.

Abstract

Conversational automatic speech recognition remains challenging due to overlapping speech, far-field noise, and varying speaker counts. While recent LLM-based systems perform well on single-speaker benchmarks, their robustness in multi-speaker settings is unclear. We systematically compare LLM-based and modular pipeline approaches along four axes: overlap robustness, semantic fidelity, speaker count, and single- versus multi-channel input. To capture meaning-altering errors that conventional metrics miss, we introduce tcpSemER, which extends tcpWER by replacing Levenshtein distance with embedding-based semantic similarity. We further decompose tcpWER into overlapping and non-overlapping components for finer-grained analysis. Experiments across three datasets show that LLM-based systems are competitive in two-speaker settings but degrade as speaker count and overlap increase, whereas…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition