Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition
Timo Lohrenz, Zhengyang Li, Tim Fingscheidt

TL;DR
This paper explores novel multi-encoder fusion techniques for transformer-based end-to-end speech recognition, demonstrating improved performance on WSJ and Librispeech without increasing model complexity.
Contribution
Introduces a multi-encoder learning method and stream fusion approach for transformers, achieving state-of-the-art results in speech recognition.
Findings
Consistent WER reduction on WSJ and Librispeech datasets.
State-of-the-art performance with 19% relative WER improvement on WSJ.
Effective fusion of magnitude and phase features without extra runtime.
Abstract
Stream fusion, also known as system combination, is a common technique in automatic speech recognition for traditional hybrid hidden Markov model approaches, yet mostly unexplored for modern deep neural network end-to-end model architectures. Here, we investigate various fusion techniques for the all-attention-based encoder-decoder architecture known as the transformer, striving to achieve optimal fusion by investigating different fusion levels in an example single-microphone setting with fusion of standard magnitude and phase features. We introduce a novel multi-encoder learning method that performs a weighted combination of two encoder-decoder multi-head attention outputs only during training. Employing then only the magnitude feature encoder in inference, we are able to show consistent improvement on Wall Street Journal (WSJ) with language model and on Librispeech, without increase…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Softmax · Linear Layer · Multi-Head Attention
