DNCASR: End-to-End Training for Speaker-Attributed ASR

Xianrui Zheng; Chao Zhang; Philip C. Woodland

arXiv:2506.01916·eess.AS·June 3, 2025·ACL

DNCASR: End-to-End Training for Speaker-Attributed ASR

Xianrui Zheng, Chao Zhang, Philip C. Woodland

PDF

Open Access 1 Video

TL;DR

DNCASR is an end-to-end system that jointly performs speaker clustering and speech recognition for multi-party meetings, improving accuracy in overlapping speech scenarios through linked decoders and serialised training.

Contribution

It introduces a novel end-to-end trainable architecture with linked decoders for joint speaker attribution and ASR, addressing overlapping speech effectively.

Findings

01

Outperforms non-linked systems on AMI-MDM corpus

02

Achieves 9.0% relative reduction in speaker-attributed WER

03

Effectively handles overlapping speech in meetings

Abstract

This paper introduces DNCASR, a novel end-to-end trainable system designed for joint neural speaker clustering and automatic speech recognition (ASR), enabling speaker-attributed transcription of long multi-party meetings. DNCASR uses two separate encoders to independently encode global speaker characteristics and local waveform information, along with two linked decoders to generate speaker-attributed transcriptions. The use of linked decoders allows the entire system to be jointly trained under a unified loss function. By employing a serialised training approach, DNCASR effectively addresses overlapping speech in real-world meetings, where the link improves the prediction of speaker indices in overlapping segments. Experiments on the AMI-MDM meeting corpus demonstrate that the jointly trained DNCASR outperforms a parallel system that does not have links between the speaker and ASR…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

DNCASR: End-to-End Training for Speaker-Attributed ASR· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques