Blank Collapse: Compressing CTC emission for the faster decoding

Minkyu Jung; Ohhyeok Kwon; Seunghyun Seo; Soonshin Seo

arXiv:2210.17017·cs.CL·June 28, 2023

Blank Collapse: Compressing CTC emission for the faster decoding

Minkyu Jung, Ohhyeok Kwon, Seunghyun Seo, Soonshin Seo

PDF

Open Access 1 Repo

TL;DR

This paper introduces a simple method to compress CTC emission calculations, significantly accelerating beam search decoding in speech recognition with minimal accuracy loss, validated through experiments and theoretical analysis.

Contribution

The paper proposes a novel, straightforward approach to reduce CTC emission calculations, enabling up to 78% faster decoding in speech recognition tasks.

Findings

01

Achieves up to 78% faster decoding speed

02

Minimal accuracy loss with the proposed method

03

More effective with higher model accuracy

Abstract

Connectionist Temporal Classification (CTC) model is a very efficient method for modeling sequences, especially for speech data. In order to use CTC model as an Automatic Speech Recognition (ASR) task, the beam search decoding with an external language model like n-gram LM is necessary to obtain reasonable results. In this paper we analyze the blank label in CTC beam search deeply and propose a very simple method to reduce the amount of calculation resulting in faster beam search decoding speed. With this method, we can get up to 78% faster decoding speed than ordinary beam search decoding with a very small loss of accuracy in LibriSpeech datasets. We prove this method is effective not only practically by experiments but also theoretically by mathematical reasoning. We also observe that this reduction is more obvious if the accuracy of the model is higher.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

minkjung/blankcollapse
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings