Alignment Knowledge Distillation for Online Streaming Attention-based Speech Recognition
Hirofumi Inaguma, Tatsuya Kawahara

TL;DR
This paper introduces CTC synchronous training, a novel end-to-end method that leverages CTC alignments to improve online streaming attention-based speech recognition, reducing errors and latency while enhancing robustness.
Contribution
It proposes a new alignment knowledge distillation approach using CTC to guide MoChA models, improving accuracy and latency without external alignment dependencies.
Findings
Significantly reduces recognition errors and emission latency.
Enhances robustness to long-form and noisy speech.
Achieves accuracy comparable to RNN-T with lower latency.
Abstract
This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems. AED models have achieved competitive performance in offline scenarios by jointly optimizing all components. They have recently been extended to an online streaming framework via models such as monotonic chunkwise attention (MoChA). However, the elaborate attention calculation process is not robust for long-form speech utterances. Moreover, the sequence-level training objective and time-restricted streaming encoder cause a nonnegligible delay in token emission during inference. To address these problems, we propose CTC synchronous training (CTC-ST), in which CTC alignments are leveraged as a reference for token boundaries to enable a MoChA model to learn optimal monotonic input-output alignments. We formulate a purely end-to-end…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
