Segment-Level Vectorized Beam Search Based on Partially Autoregressive   Inference

Masao Someki; Nicholas Eng; Yosuke Higuchi; Shinji Watanabe

arXiv:2309.14922·eess.AS·February 13, 2024

Segment-Level Vectorized Beam Search Based on Partially Autoregressive Inference

Masao Someki, Nicholas Eng, Yosuke Higuchi, Shinji Watanabe

PDF

Open Access

TL;DR

This paper introduces a segment-level vectorized beam search method for speech recognition that significantly accelerates inference speed while maintaining high accuracy, by combining greedy CTC decoding with parallel segment re-prediction.

Contribution

It proposes a novel partially autoregressive framework with segment-level vectorized beam search to speed up inference in speech recognition models.

Findings

01

12 to 13 times faster inference on LibriSpeech

02

Maintains high accuracy comparable to traditional AR decoding

03

Effective segmentation and parallel re-prediction approach

Abstract

Attention-based encoder-decoder models with autoregressive (AR) decoding have proven to be the dominant approach for automatic speech recognition (ASR) due to their superior accuracy. However, they often suffer from slow inference. This is primarily attributed to the incremental calculation of the decoder. This work proposes a partially AR framework, which employs segment-level vectorized beam search for improving the inference speed of an ASR model based on the hybrid connectionist temporal classification (CTC) attention-based architecture. It first generates an initial hypothesis using greedy CTC decoding, identifying low-confidence tokens based on their output probabilities. We then utilize the decoder to perform segment-level vectorized beam search on these tokens, re-predicting in parallel with minimal decoder calculations. Experimental results show that our method is 12 to 13…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings