Disentangling Speakers in Multi-Talker Speech Recognition with   Speaker-Aware CTC

Jiawen Kang; Lingwei Meng; Mingyu Cui; Yuejiao Wang; Xixin Wu; Xunying; Liu; Helen Meng

arXiv:2409.12388·eess.AS·January 6, 2025

Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTC

Jiawen Kang, Lingwei Meng, Mingyu Cui, Yuejiao Wang, Xixin Wu, Xunying, Liu, Helen Meng

PDF

Open Access 1 Repo

TL;DR

This paper introduces Speaker-Aware CTC (SACTC), a novel training objective that improves multi-talker speech recognition by explicitly modeling speaker disentanglement, leading to significant word error rate reductions especially in overlapping speech scenarios.

Contribution

It proposes SACTC, a new CTC variant tailored for multi-talker speech recognition, which enhances speaker disentanglement and outperforms existing methods when combined with Serialized Output Training.

Findings

01

SACTC guides encoder to separate speakers in temporal regions.

02

Relative WER reductions of 10% overall and 15% on low-overlap speech.

03

Outperforms standard SOT-CTC in various overlap conditions.

Abstract

Multi-talker speech recognition (MTASR) faces unique challenges in disentangling and transcribing overlapping speech. To address these challenges, this paper investigates the role of Connectionist Temporal Classification (CTC) in speaker disentanglement when incorporated with Serialized Output Training (SOT) for MTASR. Our visualization reveals that CTC guides the encoder to represent different speakers in distinct temporal regions of acoustic embeddings. Leveraging this insight, we propose a novel Speaker-Aware CTC (SACTC) training objective, based on the Bayes risk CTC framework. SACTC is a tailored CTC variant for multi-talker scenarios, it explicitly models speaker disentanglement by constraining the encoder to represent different speakers' tokens at specific time frames. When integrated with SOT, the SOT-SACTC model consistently outperforms standard SOT-CTC across various degrees…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kjw11/speaker-aware-ctc
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques