Adaptive Sparse and Monotonic Attention for Transformer-based Automatic   Speech Recognition

Chendong Zhao; Jianzong Wang; Wen qi Wei; Xiaoyang Qu; Haoqian Wang,; Jing Xiao

arXiv:2209.15176·cs.CL·October 3, 2022

Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

Chendong Zhao, Jianzong Wang, Wen qi Wei, Xiaoyang Qu, Haoqian Wang,, Jing Xiao

PDF

Open Access

TL;DR

This paper introduces an adaptive sparse and monotonic attention mechanism for Transformer-based speech recognition, enhancing online processing and alignment modeling, leading to improved recognition performance.

Contribution

It proposes a novel integration of sparse and monotonic attention into Transformer ASR to address limitations in streaming and alignment modeling.

Findings

01

Improved recognition accuracy on benchmark datasets.

02

Effective modeling of monotonic alignments.

03

Enhanced attention mechanism for online ASR.

Abstract

The Transformer architecture model, based on self-attention and multi-head attention, has achieved remarkable success in offline end-to-end Automatic Speech Recognition (ASR). However, self-attention and multi-head attention cannot be easily applied for streaming or online ASR. For self-attention in Transformer ASR, the softmax normalization function-based attention mechanism makes it impossible to highlight important speech information. For multi-head attention in Transformer ASR, it is not easy to model monotonic alignments in different heads. To overcome these two limits, we integrate sparse attention and monotonic attention into Transformer-based ASR. The sparse mechanism introduces a learned sparsity scheme to enable each self-attention structure to fit the corresponding head better. The monotonic attention deploys regularization to prune redundant heads for the multi-head…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Adam · Dense Connections · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dropout · Layer Normalization