Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict

Yosuke Higuchi; Shinji Watanabe; Nanxin Chen; Tetsuji Ogawa; Tetsunori; Kobayashi

arXiv:2005.08700·eess.AS·August 18, 2020·6 cites

Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict

Yosuke Higuchi, Shinji Watanabe, Nanxin Chen, Tetsuji Ogawa, Tetsunori, Kobayashi

PDF

Open Access

TL;DR

Mask CTC introduces a non-autoregressive speech recognition model that refines CTC outputs through mask prediction, significantly reducing inference time while maintaining high accuracy.

Contribution

The paper proposes Mask CTC, a novel non-autoregressive end-to-end ASR framework that combines CTC with mask prediction for faster inference and competitive accuracy.

Findings

01

Outperforms standard CTC with lower WER (e.g., 17.9% to 12.1%)

02

Achieves near-autoregressive accuracy with much faster inference (0.07 RTF)

03

Effective on multiple speech recognition tasks.

Abstract

We present Mask CTC, a novel non-autoregressive end-to-end automatic speech recognition (ASR) framework, which generates a sequence by refining outputs of the connectionist temporal classification (CTC). Neural sequence-to-sequence models are usually \textit{autoregressive}: each output token is generated by conditioning on previously generated tokens, at the cost of requiring as many iterations as the output length. On the other hand, non-autoregressive models can simultaneously generate tokens within a constant number of iterations, which results in significant inference time reduction and better suits end-to-end ASR model for real-world scenarios. In this work, Mask CTC model is trained using a Transformer encoder-decoder with joint training of mask prediction and CTC. During inference, the target sequence is initialized with the greedy CTC outputs and low-confidence tokens are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing