Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition
Niko Moritz, Takaaki Hori, Jonathan Le Roux

TL;DR
This paper introduces the dual causal/non-causal self-attention (DCN) architecture for streaming end-to-end speech recognition, improving performance and enabling frame-synchronous processing compared to previous self-attention methods.
Contribution
The paper proposes the DCN architecture that balances causal and non-causal self-attention, enhancing streaming ASR performance and enabling frame-synchronous processing.
Findings
DCN outperforms restricted self-attention in streaming ASR tasks.
The proposed method achieves state-of-the-art results on LibriSpeech, HKUST, and Switchboard.
DCN provides a good trade-off between context and latency in streaming ASR.
Abstract
Attention-based end-to-end automatic speech recognition (ASR) systems have recently demonstrated state-of-the-art results for numerous tasks. However, the application of self-attention and attention-based encoder-decoder models remains challenging for streaming ASR, where each word must be recognized shortly after it was spoken. In this work, we present the dual causal/non-causal self-attention (DCN) architecture, which in contrast to restricted self-attention prevents the overall context to grow beyond the look-ahead of a single layer when used in a deep architecture. DCN is compared to chunk-based and restricted self-attention using streaming transformer and conformer architectures, showing improved ASR performance over restricted self-attention and competitive ASR results compared to chunk-based self-attention, while providing the advantage of frame-synchronous processing. Combined…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
