Overlapped speech recognition from a jointly learned multi-channel neural speech extraction and representation
Bo Wu, Meng Yu, Lianwu Chen, Chao Weng, Dan Su, Dong Yu

TL;DR
This paper introduces a joint neural network framework for overlapped speech recognition that combines multi-channel speech extraction and acoustic modeling, achieving significant error rate reductions and robustness improvements.
Contribution
It presents a novel end-to-end joint optimization of multi-channel speech extraction and acoustic modeling without relying on traditional mel-filterbank features.
Findings
28% WER reduction on AISHELL-1
Robustness to SIR and speaker angle variations
Improved recognition with learnable feature projection
Abstract
We propose an end-to-end joint optimization framework of a multi-channel neural speech extraction and deep acoustic model without mel-filterbank (FBANK) extraction for overlapped speech recognition. First, based on a multi-channel convolutional TasNet with STFT kernel, we unify the multi-channel target speech enhancement front-end network and a convolutional, long short-term memory and fully connected deep neural network (CLDNN) based acoustic model (AM) with the FBANK extraction layer to build a hybrid neural network, which is thus jointly updated only by the recognition loss. The proposed framework achieves 28% word error rate reduction (WERR) over a separately optimized system on AISHELL-1 and shows consistent robustness to signal to interference ratio (SIR) and angle difference between overlapping speakers. Next, a further exploration shows that the speech recognition is improved…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
