Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition
Chendong Zhao, Jianzong Wang, Wen qi Wei, Xiaoyang Qu, Haoqian Wang,, Jing Xiao

TL;DR
This paper introduces an adaptive sparse and monotonic attention mechanism for Transformer-based speech recognition, enhancing online processing and alignment modeling, leading to improved recognition performance.
Contribution
It proposes a novel integration of sparse and monotonic attention into Transformer ASR to address limitations in streaming and alignment modeling.
Findings
Improved recognition accuracy on benchmark datasets.
Effective modeling of monotonic alignments.
Enhanced attention mechanism for online ASR.
Abstract
The Transformer architecture model, based on self-attention and multi-head attention, has achieved remarkable success in offline end-to-end Automatic Speech Recognition (ASR). However, self-attention and multi-head attention cannot be easily applied for streaming or online ASR. For self-attention in Transformer ASR, the softmax normalization function-based attention mechanism makes it impossible to highlight important speech information. For multi-head attention in Transformer ASR, it is not easy to model monotonic alignments in different heads. To overcome these two limits, we integrate sparse attention and monotonic attention into Transformer-based ASR. The sparse mechanism introduces a learned sparsity scheme to enable each self-attention structure to fit the corresponding head better. The monotonic attention deploys regularization to prune redundant heads for the multi-head…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Adam · Dense Connections · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dropout · Layer Normalization
