Blank Collapse: Compressing CTC emission for the faster decoding
Minkyu Jung, Ohhyeok Kwon, Seunghyun Seo, Soonshin Seo

TL;DR
This paper introduces a simple method to compress CTC emission calculations, significantly accelerating beam search decoding in speech recognition with minimal accuracy loss, validated through experiments and theoretical analysis.
Contribution
The paper proposes a novel, straightforward approach to reduce CTC emission calculations, enabling up to 78% faster decoding in speech recognition tasks.
Findings
Achieves up to 78% faster decoding speed
Minimal accuracy loss with the proposed method
More effective with higher model accuracy
Abstract
Connectionist Temporal Classification (CTC) model is a very efficient method for modeling sequences, especially for speech data. In order to use CTC model as an Automatic Speech Recognition (ASR) task, the beam search decoding with an external language model like n-gram LM is necessary to obtain reasonable results. In this paper we analyze the blank label in CTC beam search deeply and propose a very simple method to reduce the amount of calculation resulting in faster beam search decoding speed. With this method, we can get up to 78% faster decoding speed than ordinary beam search decoding with a very small loss of accuracy in LibriSpeech datasets. We prove this method is effective not only practically by experiments but also theoretically by mathematical reasoning. We also observe that this reduction is more obvious if the accuracy of the model is higher.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
