Improving Streaming Speech Recognition With Time-Shifted Contextual Attention And Dynamic Right Context Masking
Khanh Le, Duc Chau

TL;DR
This paper introduces a novel streaming speech recognition method using Time-Shifted Contextual Attention and Dynamic Right Context masking, significantly improving accuracy by leveraging future context with minimal latency.
Contribution
The paper proposes TSCA and DRC masking to incorporate future context in streaming speech recognition, achieving notable WER reductions and practical real-time implementation.
Findings
Achieved 10-13.9% relative WER reduction on Librispeech.
Enabled batch processing with minimal latency.
Demonstrated practical streaming ASR pipeline with TSCA.
Abstract
Chunk-based inference stands out as a popular approach in developing real-time streaming speech recognition, valued for its simplicity and efficiency. However, because it restricts the model's focus to only the history and current chunk context, it may result in performance degradation in scenarios that demand consideration of future context. Addressing this, we propose a novel approach featuring Time-Shifted Contextual Attention (TSCA) and Dynamic Right Context (DRC) masking. Our method shows a relative word error rate reduction of 10 to 13.9% on the Librispeech dataset with the inclusion of in-context future information provided by TSCA. Moreover, we present a streaming automatic speech recognition pipeline that facilitates the integration of TSCA with minimal user-perceived latency, while also enabling batch processing capability, making it practical for various applications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
