Overlapped speech recognition from a jointly learned multi-channel   neural speech extraction and representation

Bo Wu; Meng Yu; Lianwu Chen; Chao Weng; Dan Su; Dong Yu

arXiv:1910.13825·eess.AS·October 31, 2019

Overlapped speech recognition from a jointly learned multi-channel neural speech extraction and representation

Bo Wu, Meng Yu, Lianwu Chen, Chao Weng, Dan Su, Dong Yu

PDF

Open Access

TL;DR

This paper introduces a joint neural network framework for overlapped speech recognition that combines multi-channel speech extraction and acoustic modeling, achieving significant error rate reductions and robustness improvements.

Contribution

It presents a novel end-to-end joint optimization of multi-channel speech extraction and acoustic modeling without relying on traditional mel-filterbank features.

Findings

01

28% WER reduction on AISHELL-1

02

Robustness to SIR and speaker angle variations

03

Improved recognition with learnable feature projection

Abstract

We propose an end-to-end joint optimization framework of a multi-channel neural speech extraction and deep acoustic model without mel-filterbank (FBANK) extraction for overlapped speech recognition. First, based on a multi-channel convolutional TasNet with STFT kernel, we unify the multi-channel target speech enhancement front-end network and a convolutional, long short-term memory and fully connected deep neural network (CLDNN) based acoustic model (AM) with the FBANK extraction layer to build a hybrid neural network, which is thus jointly updated only by the recognition loss. The proposed framework achieves 28% word error rate reduction (WERR) over a separately optimized system on AISHELL-1 and shows consistent robustness to signal to interference ratio (SIR) and angle difference between overlapping speakers. Next, a further exploration shows that the speech recognition is improved…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing