Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture
Haoran Miao, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan

TL;DR
This paper introduces a comprehensive online hybrid CTC/attention end-to-end speech recognition architecture that replaces offline components with streaming modules, enabling real-time ASR with maintained accuracy.
Contribution
It presents novel streaming components for CTC and attention mechanisms, including sMoChA, MTA, T-CTC, and DWJD, to enable fully online speech recognition.
Findings
Improved real-time factor in human-computer interaction
Maintained recognition performance with moderate degradation
First full-stack online CTC/attention ASR solution
Abstract
Recently, there has been increasing progress in end-to-end automatic speech recognition (ASR) architecture, which transcribes speech to text without any pre-trained alignments. One popular end-to-end approach is the hybrid Connectionist Temporal Classification (CTC) and attention (CTC/attention) based ASR architecture. However, how to deploy hybrid CTC/attention systems for online speech recognition is still a non-trivial problem. This article describes our proposed online hybrid CTC/attention end-to-end ASR architecture, which replaces all the offline components of conventional CTC/attention ASR architecture with their corresponding streaming components. Firstly, we propose stable monotonic chunk-wise attention (sMoChA) to stream the conventional global attention, and further propose monotonic truncated attention (MTA) to simplify sMoChA and solve the training-and-decoding mismatch…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
