Relaxing the Conditional Independence Assumption of CTC-based ASR by Conditioning on Intermediate Predictions
Jumon Nozaki, Tatsuya Komatsu

TL;DR
This paper introduces a method to relax the conditional independence assumption in CTC-based ASR by conditioning on intermediate predictions, leading to significant WER improvements and faster decoding compared to autoregressive models.
Contribution
The paper proposes a novel approach of using auxiliary intermediate CTC losses to condition predictions, enhancing accuracy while maintaining simplicity and speed of CTC models.
Findings
Over 20% relative WER reduction on WSJ corpus
Achieves comparable performance to autoregressive models
Decoding speed is at least 30 times faster
Abstract
This paper proposes a method to relax the conditional independence assumption of connectionist temporal classification (CTC)-based automatic speech recognition (ASR) models. We train a CTC-based ASR model with auxiliary CTC losses in intermediate layers in addition to the original CTC loss in the last layer. During both training and inference, each generated prediction in the intermediate layers is summed to the input of the next layer to condition the prediction of the last layer on those intermediate predictions. Our method is easy to implement and retains the merits of CTC-based ASR: a simple model architecture and fast decoding speed. We conduct experiments on three different ASR corpora. Our proposed method improves a standard CTC model significantly (e.g., more than 20 % relative word error rate reduction on the WSJ corpus) with a little computational overhead. Moreover, for the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
MethodsConnectionist Temporal Classification Loss
