Multi-Encoder Learning and Stream Fusion for Transformer-Based   End-to-End Automatic Speech Recognition

Timo Lohrenz; Zhengyang Li; Tim Fingscheidt

arXiv:2104.00120·eess.AS·July 15, 2021

Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition

Timo Lohrenz, Zhengyang Li, Tim Fingscheidt

PDF

TL;DR

This paper explores novel multi-encoder fusion techniques for transformer-based end-to-end speech recognition, demonstrating improved performance on WSJ and Librispeech without increasing model complexity.

Contribution

Introduces a multi-encoder learning method and stream fusion approach for transformers, achieving state-of-the-art results in speech recognition.

Findings

01

Consistent WER reduction on WSJ and Librispeech datasets.

02

State-of-the-art performance with 19% relative WER improvement on WSJ.

03

Effective fusion of magnitude and phase features without extra runtime.

Abstract

Stream fusion, also known as system combination, is a common technique in automatic speech recognition for traditional hybrid hidden Markov model approaches, yet mostly unexplored for modern deep neural network end-to-end model architectures. Here, we investigate various fusion techniques for the all-attention-based encoder-decoder architecture known as the transformer, striving to achieve optimal fusion by investigating different fusion levels in an example single-microphone setting with fusion of standard magnitude and phase features. We introduce a novel multi-encoder learning method that performs a weighted combination of two encoder-decoder multi-head attention outputs only during training. Employing then only the magnitude feature encoder in inference, we are able to show consistent improvement on Wall Street Journal (WSJ) with language model and on Librispeech, without increase…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Softmax · Linear Layer · Multi-Head Attention