Alignment Knowledge Distillation for Online Streaming Attention-based   Speech Recognition

Hirofumi Inaguma; Tatsuya Kawahara

arXiv:2103.00422·eess.AS·August 24, 2021

Alignment Knowledge Distillation for Online Streaming Attention-based Speech Recognition

Hirofumi Inaguma, Tatsuya Kawahara

PDF

Open Access

TL;DR

This paper introduces CTC synchronous training, a novel end-to-end method that leverages CTC alignments to improve online streaming attention-based speech recognition, reducing errors and latency while enhancing robustness.

Contribution

It proposes a new alignment knowledge distillation approach using CTC to guide MoChA models, improving accuracy and latency without external alignment dependencies.

Findings

01

Significantly reduces recognition errors and emission latency.

02

Enhances robustness to long-form and noisy speech.

03

Achieves accuracy comparable to RNN-T with lower latency.

Abstract

This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems. AED models have achieved competitive performance in offline scenarios by jointly optimizing all components. They have recently been extended to an online streaming framework via models such as monotonic chunkwise attention (MoChA). However, the elaborate attention calculation process is not robust for long-form speech utterances. Moreover, the sequence-level training objective and time-restricted streaming encoder cause a nonnegligible delay in token emission during inference. To address these problems, we propose CTC synchronous training (CTC-ST), in which CTC alignments are leveraged as a reference for token boundaries to enable a MoChA model to learn optimal monotonic input-output alignments. We formulate a purely end-to-end…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing