Head-synchronous Decoding for Transformer-based Streaming ASR

Mohan Li; Catalin Zorila; Rama Doddipatla

arXiv:2104.12631·eess.AS·April 27, 2021·ICASSP·1 cites

Head-synchronous Decoding for Transformer-based Streaming ASR

Mohan Li, Catalin Zorila, Rama Doddipatla

PDF

Open Access

TL;DR

This paper introduces a head-synchronous DACS algorithm for Transformer-based streaming ASR, improving decoding stability and accuracy by synchronizing attention heads, and demonstrates its superior performance on multiple datasets.

Contribution

The paper proposes a novel head-synchronous DACS method that enhances online Transformer ASR by synchronizing attention heads, leading to better stability and performance.

Findings

01

HS-DACS outperforms vanilla DACS on WSJ, AIShell-1, and Librispeech.

02

The method achieves state-of-the-art results in streaming ASR.

03

HS-DACS reduces decoding cost compared to vanilla DACS.

Abstract

Online Transformer-based automatic speech recognition (ASR) systems have been extensively studied due to the increasing demand for streaming applications. Recently proposed Decoder-end Adaptive Computation Steps (DACS) algorithm for online Transformer ASR was shown to achieve state-of-the-art performance and outperform other existing methods. However, like any other online approach, the DACS-based attention heads in each of the Transformer decoder layers operate independently (or asynchronously) and lead to diverged attending positions. Since DACS employs a truncation threshold to determine the halting position, some of the attention weights are cut off untimely and might impact the stability and precision of decoding. To overcome these issues, here we propose a head-synchronous (HS) version of the DACS algorithm, where the boundary of attention is jointly detected by all the DACS heads…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Softmax · Multi-Head Attention · Layer Normalization · Adam · Label Smoothing