Improved Mask-CTC for Non-Autoregressive End-to-End ASR
Yosuke Higuchi, Hirofumi Inaguma, Shinji Watanabe, Tetsuji Ogawa,, Tetsunori Kobayashi

TL;DR
This paper enhances the Mask-CTC non-autoregressive ASR system by integrating Conformer architecture and auxiliary objectives, significantly improving accuracy while maintaining fast inference speed, and demonstrating potential in speech translation.
Contribution
The paper introduces architectural and training improvements to Mask-CTC, achieving higher accuracy without sacrificing inference speed, and explores its application to speech translation.
Findings
WER reduced from 15.5% to 9.1% on WSJ
Achieves competitive results to autoregressive models
Maintains inference speed with less than 0.1 RTF
Abstract
For real-world deployment of automatic speech recognition (ASR), the system is desired to be capable of fast inference while relieving the requirement of computational resources. The recently proposed end-to-end ASR system based on mask-predict with connectionist temporal classification (CTC), Mask-CTC, fulfills this demand by generating tokens in a non-autoregressive fashion. While Mask-CTC achieves remarkably fast inference speed, its recognition performance falls behind that of conventional autoregressive (AR) systems. To boost the performance of Mask-CTC, we first propose to enhance the encoder network architecture by employing a recently proposed architecture called Conformer. Next, we propose new training and decoding methods by introducing auxiliary objective to predict the length of a partial target sequence, which allows the model to delete or insert tokens during inference.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
