Investigation of Practical Aspects of Single Channel Speech Separation   for ASR

Jian Wu; Zhuo Chen; Sanyuan Chen; Yu Wu; Takuya Yoshioka; Naoyuki; Kanda; Shujie Liu; Jinyu Li

arXiv:2107.01922·eess.AS·July 6, 2021

Investigation of Practical Aspects of Single Channel Speech Separation for ASR

Jian Wu, Zhuo Chen, Sanyuan Chen, Yu Wu, Takuya Yoshioka, Naoyuki, Kanda, Shujie Liu, Jinyu Li

PDF

Open Access

TL;DR

This paper enhances single channel speech separation for ASR by combining a two-stage training scheme with model compression, leading to significant WER improvements on LibriCSS with lightweight models.

Contribution

It introduces a novel two-stage training approach and a modified teacher-student technique for model compression in speech separation for ASR.

Findings

01

Achieved 2.70% absolute WER reduction on LibriCSS.

02

Developed a lightweight model with less than 10M parameters.

03

Demonstrated improved performance in both utterance-wise and continuous evaluation.

Abstract

Speech separation has been successfully applied as a frontend processing module of conversation transcription systems thanks to its ability to handle overlapped speech and its flexibility to combine with downstream tasks such as automatic speech recognition (ASR). However, a speech separation model often introduces target speech distortion, resulting in a sub-optimum word error rate (WER). In this paper, we describe our efforts to improve the performance of a single channel speech separation system. Specifically, we investigate a two-stage training scheme that firstly applies a feature level optimization criterion for pretraining, followed by an ASR-oriented optimization criterion using an end-to-end (E2E) speech recognition model. Meanwhile, to keep the model light-weight, we introduce a modified teacher-student learning technique for model compression. By combining those approaches,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing