Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech   Recognition

Niko Moritz; Takaaki Hori; Jonathan Le Roux

arXiv:2107.01269·eess.AS·July 6, 2021

Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition

Niko Moritz, Takaaki Hori, Jonathan Le Roux

PDF

TL;DR

This paper introduces the dual causal/non-causal self-attention (DCN) architecture for streaming end-to-end speech recognition, improving performance and enabling frame-synchronous processing compared to previous self-attention methods.

Contribution

The paper proposes the DCN architecture that balances causal and non-causal self-attention, enhancing streaming ASR performance and enabling frame-synchronous processing.

Findings

01

DCN outperforms restricted self-attention in streaming ASR tasks.

02

The proposed method achieves state-of-the-art results on LibriSpeech, HKUST, and Switchboard.

03

DCN provides a good trade-off between context and latency in streaming ASR.

Abstract

Attention-based end-to-end automatic speech recognition (ASR) systems have recently demonstrated state-of-the-art results for numerous tasks. However, the application of self-attention and attention-based encoder-decoder models remains challenging for streaming ASR, where each word must be recognized shortly after it was spoken. In this work, we present the dual causal/non-causal self-attention (DCN) architecture, which in contrast to restricted self-attention prevents the overall context to grow beyond the look-ahead of a single layer when used in a deep architecture. DCN is compared to chunk-based and restricted self-attention using streaming transformer and conformer architectures, showing improved ASR performance over restricted self-attention and competitive ASR results compared to chunk-based self-attention, while providing the advantage of frame-synchronous processing. Combined…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.