End-to-End Multi-speaker Speech Recognition with Transformer

Xuankai Chang; Wangyou Zhang; Yanmin Qian; Jonathan Le Roux; Shinji; Watanabe

arXiv:2002.03921·eess.AS·February 14, 2020·5 cites

End-to-End Multi-speaker Speech Recognition with Transformer

Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux, Shinji, Watanabe

PDF

Open Access

TL;DR

This paper introduces Transformer-based end-to-end models for multi-speaker speech recognition, replacing RNN components and incorporating dereverberation preprocessing, resulting in significant WER reductions in both single-channel and multi-channel scenarios.

Contribution

It is the first to apply Transformer architectures to multi-speaker speech recognition with multi-channel processing and dereverberation, demonstrating substantial performance improvements.

Findings

01

Achieved up to 40.9% relative WER reduction in single-channel tasks.

02

Achieved up to 25.6% relative WER reduction in multi-channel tasks.

03

Effective handling of reverberated signals with WPE preprocessing.

Abstract

Recently, fully recurrent neural network (RNN) based end-to-end models have been proven to be effective for multi-speaker speech recognition in both the single-channel and multi-channel scenarios. In this work, we explore the use of Transformer models for these tasks by focusing on two aspects. First, we replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture. Second, in order to use the Transformer in the masking network of the neural beamformer in the multi-channel case, we modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation. Besides the model architecture improvements, we also incorporate an external dereverberation preprocessing, the weighted prediction error (WPE), enabling our model to handle reverberated signals. Experiments on the spatialized wsj1-2mix corpus…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax