Relaxed Attention: A Simple Method to Boost Performance of End-to-End   Automatic Speech Recognition

Timo Lohrenz; Patrick Schwarz; Zhengyang Li; Tim Fingscheidt

arXiv:2107.01275·eess.AS·December 16, 2021

Relaxed Attention: A Simple Method to Boost Performance of End-to-End Automatic Speech Recognition

Timo Lohrenz, Patrick Schwarz, Zhengyang Li, Tim Fingscheidt

PDF

1 Repo

TL;DR

This paper introduces relaxed attention, a simple method that improves end-to-end speech recognition models by injecting uniform distribution into attention weights during training, leading to state-of-the-art results on WSJ.

Contribution

The paper proposes a straightforward, easily implementable relaxed attention technique that enhances transformer-based ASR performance across multiple architectures and datasets.

Findings

01

Relaxed attention improves WER on WSJ and Librispeech datasets.

02

Transformer models with relaxed attention outperform baseline models.

03

Achieved new state-of-the-art WER of 3.65% on WSJ.

Abstract

Recently, attention-based encoder-decoder (AED) models have shown high performance for end-to-end automatic speech recognition (ASR) across several tasks. Addressing overconfidence in such models, in this paper we introduce the concept of relaxed attention, which is a simple gradual injection of a uniform distribution to the encoder-decoder attention weights during training that is easily implemented with two lines of code. We investigate the effect of relaxed attention across different AED model architectures and two prominent ASR tasks, Wall Street Journal (WSJ) and Librispeech. We found that transformers trained with relaxed attention outperform the standard baseline models consistently during decoding with external language models. On WSJ, we set a new benchmark for transformer-based end-to-end speech recognition with a word error rate of 3.65%, outperforming state of the art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

freewym/espresso
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Dropout · Byte Pair Encoding · Adam · Dense Connections · Softmax