Relaxed Attention for Transformer Models
Timo Lohrenz, Bj\"orn M\"oller, Zhengyang Li, Tim Fingscheidt

TL;DR
This paper introduces relaxed attention, a simple smoothing technique for transformer models that improves regularization, enhances external language model integration, and achieves state-of-the-art results in lip-reading and machine translation tasks.
Contribution
It proposes relaxed attention as a novel method to regularize transformers and facilitate external language model integration, leading to improved performance.
Findings
Achieved 26.31% WER on LRS3 lip-reading benchmark.
Attained 37.67 BLEU score on IWSLT14 translation task.
Demonstrated improved regularization and external LM support.
Abstract
The powerful modeling capabilities of all-attention-based transformer architectures often cause overfitting and - for natural language processing tasks - lead to an implicitly learned internal language model in the autoregressive transformer decoder complicating the integration of external language models. In this paper, we explore relaxed attention, a simple and easy-to-implement smoothing of the attention weights, yielding a two-fold improvement to the general transformer architecture: First, relaxed attention provides regularization when applied to the self-attention layers in the encoder. Second, we show that it naturally supports the integration of an external language model as it suppresses the implicitly learned internal language model by relaxing the cross attention in the decoder. We demonstrate the benefit of relaxed attention across several tasks with clear improvement in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
