Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition   Architecture

Haoran Miao; Gaofeng Cheng; Pengyuan Zhang; Yonghong Yan

arXiv:2307.02351·eess.AS·July 6, 2023·IEEE ACM Trans. Audio Speech Lang. Process.

Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture

Haoran Miao, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan

PDF

TL;DR

This paper introduces a comprehensive online hybrid CTC/attention end-to-end speech recognition architecture that replaces offline components with streaming modules, enabling real-time ASR with maintained accuracy.

Contribution

It presents novel streaming components for CTC and attention mechanisms, including sMoChA, MTA, T-CTC, and DWJD, to enable fully online speech recognition.

Findings

01

Improved real-time factor in human-computer interaction

02

Maintained recognition performance with moderate degradation

03

First full-stack online CTC/attention ASR solution

Abstract

Recently, there has been increasing progress in end-to-end automatic speech recognition (ASR) architecture, which transcribes speech to text without any pre-trained alignments. One popular end-to-end approach is the hybrid Connectionist Temporal Classification (CTC) and attention (CTC/attention) based ASR architecture. However, how to deploy hybrid CTC/attention systems for online speech recognition is still a non-trivial problem. This article describes our proposed online hybrid CTC/attention end-to-end ASR architecture, which replaces all the offline components of conventional CTC/attention ASR architecture with their corresponding streaming components. Firstly, we propose stable monotonic chunk-wise attention (sMoChA) to stream the conventional global attention, and further propose monotonic truncated attention (MTA) to simplify sMoChA and solve the training-and-decoding mismatch…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.