End-to-End Multi-speaker Speech Recognition with Transformer
Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux, Shinji, Watanabe

TL;DR
This paper introduces Transformer-based end-to-end models for multi-speaker speech recognition, replacing RNN components and incorporating dereverberation preprocessing, resulting in significant WER reductions in both single-channel and multi-channel scenarios.
Contribution
It is the first to apply Transformer architectures to multi-speaker speech recognition with multi-channel processing and dereverberation, demonstrating substantial performance improvements.
Findings
Achieved up to 40.9% relative WER reduction in single-channel tasks.
Achieved up to 25.6% relative WER reduction in multi-channel tasks.
Effective handling of reverberated signals with WPE preprocessing.
Abstract
Recently, fully recurrent neural network (RNN) based end-to-end models have been proven to be effective for multi-speaker speech recognition in both the single-channel and multi-channel scenarios. In this work, we explore the use of Transformer models for these tasks by focusing on two aspects. First, we replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture. Second, in order to use the Transformer in the masking network of the neural beamformer in the multi-channel case, we modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation. Besides the model architecture improvements, we also incorporate an external dereverberation preprocessing, the weighted prediction error (WPE), enabling our model to handle reverberated signals. Experiments on the spatialized wsj1-2mix corpus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
