Integration of speech separation, diarization, and recognition for   multi-speaker meetings: System description, comparison, and analysis

Desh Raj; Pavel Denisov; Zhuo Chen; Hakan Erdogan; Zili Huang; Maokui; He; Shinji Watanabe; Jun Du; Takuya Yoshioka; Yi Luo; Naoyuki Kanda; Jinyu; Li; Scott Wisdom; John R. Hershey

arXiv:2011.02014·eess.AS·November 5, 2020

Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis

Desh Raj, Pavel Denisov, Zhuo Chen, Hakan Erdogan, Zili Huang, Maokui, He, Shinji Watanabe, Jun Du, Takuya Yoshioka, Yi Luo, Naoyuki Kanda, Jinyu, Li, Scott Wisdom, John R. Hershey

PDF

TL;DR

This paper presents a modular system integrating speech separation, diarization, and recognition for multi-speaker meeting transcription, demonstrating effective handling of overlapping speech and achieving near non-overlapping WER levels.

Contribution

It introduces an end-to-end pipeline combining independently trained modules and analyzes their impact on multi-speaker transcription accuracy.

Findings

01

Separation module significantly improves diarization and recognition performance.

02

The best system achieves a speaker-attributed WER of 12.7%.

03

Effective mitigation of overlapping speech issues was demonstrated.

Abstract

Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation. With technical advances in systems dealing with speech separation, speaker diarization, and automatic speech recognition (ASR) in the last decade, it has become possible to build pipelines that achieve reasonable error rates on this task. In this paper, we propose an end-to-end modular system for the LibriCSS meeting data, which combines independently trained separation, diarization, and recognition components, in that order. We study the effect of different state-of-the-art methods at each stage of the pipeline, and report results using task-specific metrics like SDR and DER, as well as downstream WER. Experiments indicate that the problem of overlapping speech for diarization and ASR can be effectively mitigated with the presence of a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.