TL;DR
This paper introduces relaxed attention, a simple method that improves end-to-end speech recognition models by injecting uniform distribution into attention weights during training, leading to state-of-the-art results on WSJ.
Contribution
The paper proposes a straightforward, easily implementable relaxed attention technique that enhances transformer-based ASR performance across multiple architectures and datasets.
Findings
Relaxed attention improves WER on WSJ and Librispeech datasets.
Transformer models with relaxed attention outperform baseline models.
Achieved new state-of-the-art WER of 3.65% on WSJ.
Abstract
Recently, attention-based encoder-decoder (AED) models have shown high performance for end-to-end automatic speech recognition (ASR) across several tasks. Addressing overconfidence in such models, in this paper we introduce the concept of relaxed attention, which is a simple gradual injection of a uniform distribution to the encoder-decoder attention weights during training that is easily implemented with two lines of code. We investigate the effect of relaxed attention across different AED model architectures and two prominent ASR tasks, Wall Street Journal (WSJ) and Librispeech. We found that transformers trained with relaxed attention outperform the standard baseline models consistently during decoding with external language models. On WSJ, we set a new benchmark for transformer-based end-to-end speech recognition with a word error rate of 3.65%, outperforming state of the art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Dropout · Byte Pair Encoding · Adam · Dense Connections · Softmax
