Head-synchronous Decoding for Transformer-based Streaming ASR
Mohan Li, Catalin Zorila, Rama Doddipatla

TL;DR
This paper introduces a head-synchronous DACS algorithm for Transformer-based streaming ASR, improving decoding stability and accuracy by synchronizing attention heads, and demonstrates its superior performance on multiple datasets.
Contribution
The paper proposes a novel head-synchronous DACS method that enhances online Transformer ASR by synchronizing attention heads, leading to better stability and performance.
Findings
HS-DACS outperforms vanilla DACS on WSJ, AIShell-1, and Librispeech.
The method achieves state-of-the-art results in streaming ASR.
HS-DACS reduces decoding cost compared to vanilla DACS.
Abstract
Online Transformer-based automatic speech recognition (ASR) systems have been extensively studied due to the increasing demand for streaming applications. Recently proposed Decoder-end Adaptive Computation Steps (DACS) algorithm for online Transformer ASR was shown to achieve state-of-the-art performance and outperform other existing methods. However, like any other online approach, the DACS-based attention heads in each of the Transformer decoder layers operate independently (or asynchronously) and lead to diverged attending positions. Since DACS employs a truncation threshold to determine the halting position, some of the attention weights are cut off untimely and might impact the stability and precision of decoding. To overcome these issues, here we propose a head-synchronous (HS) version of the DACS algorithm, where the boundary of attention is jointly detected by all the DACS heads…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Softmax · Multi-Head Attention · Layer Normalization · Adam · Label Smoothing
