Weak-Attention Suppression For Transformer Based Speech Recognition

Yangyang Shi; Yongqiang Wang; Chunyang Wu; Christian Fuegen; Frank; Zhang; Duc Le; Ching-Feng Yeh; Michael L. Seltzer

arXiv:2005.09137·eess.AS·May 20, 2020·5 cites

Weak-Attention Suppression For Transformer Based Speech Recognition

Yangyang Shi, Yongqiang Wang, Chunyang Wu, Christian Fuegen, Frank, Zhang, Duc Le, Ching-Feng Yeh, Michael L. Seltzer

PDF

Open Access

TL;DR

This paper introduces Weak-Attention Suppression (WAS), a dynamic sparsity method for transformers in speech recognition, improving accuracy by reducing attention to redundant frames and emphasizing critical acoustic information.

Contribution

WAS is a novel technique that induces sparsity in attention, leading to state-of-the-art streaming speech recognition performance.

Findings

01

WAS reduces WER by 10% on test-clean and 5% on test-other.

02

WAS suppresses attention to non-critical frames, improving model focus.

03

WAS outperforms strong baselines on LibriSpeech benchmark.

Abstract

Transformers, originally proposed for natural language processing (NLP) tasks, have recently achieved great success in automatic speech recognition (ASR). However, adjacent acoustic units (i.e., frames) are highly correlated, and long-distance dependencies between them are weak, unlike text units. It suggests that ASR will likely benefit from sparse and localized attention. In this paper, we propose Weak-Attention Suppression (WAS), a method that dynamically induces sparsity in attention probabilities. We demonstrate that WAS leads to consistent Word Error Rate (WER) improvement over strong transformer baselines. On the widely used LibriSpeech benchmark, our proposed method reduced WER by 10%$ on test-clean and 5% on test-other for streamable transformers, resulting in a new state-of-the-art among streaming models. Further analysis shows that WAS learns to suppress attention of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Multi-Head Attention · Adam · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout · Byte Pair Encoding