Segment-Level Vectorized Beam Search Based on Partially Autoregressive Inference
Masao Someki, Nicholas Eng, Yosuke Higuchi, Shinji Watanabe

TL;DR
This paper introduces a segment-level vectorized beam search method for speech recognition that significantly accelerates inference speed while maintaining high accuracy, by combining greedy CTC decoding with parallel segment re-prediction.
Contribution
It proposes a novel partially autoregressive framework with segment-level vectorized beam search to speed up inference in speech recognition models.
Findings
12 to 13 times faster inference on LibriSpeech
Maintains high accuracy comparable to traditional AR decoding
Effective segmentation and parallel re-prediction approach
Abstract
Attention-based encoder-decoder models with autoregressive (AR) decoding have proven to be the dominant approach for automatic speech recognition (ASR) due to their superior accuracy. However, they often suffer from slow inference. This is primarily attributed to the incremental calculation of the decoder. This work proposes a partially AR framework, which employs segment-level vectorized beam search for improving the inference speed of an ASR model based on the hybrid connectionist temporal classification (CTC) attention-based architecture. It first generates an initial hypothesis using greedy CTC decoding, identifying low-confidence tokens based on their output probabilities. We then utilize the decoder to perform segment-level vectorized beam search on these tokens, re-predicting in parallel with minimal decoder calculations. Experimental results show that our method is 12 to 13…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
