Run-and-back stitch search: novel block synchronous decoding for streaming encoder-decoder ASR
Emiru Tsunoo, Chaitanya Narisetty, Michael Hentschel, Yosuke, Kashiwagi, Shinji Watanabe

TL;DR
This paper introduces a novel blockwise synchronous decoding algorithm for streaming encoder-decoder ASR systems that reduces latency and computational costs by combining endpoint prediction and post-determination techniques.
Contribution
It proposes the run-and-back stitch search, a hybrid decoding method that improves streaming ASR performance by effectively synchronizing blocks and handling mispredictions.
Findings
Latency reduced from 1487 ms to 821 ms at the 90th percentile on Librispeech.
Maintains high recognition accuracy despite latency improvements.
Reduces computational cost in streaming ASR decoding.
Abstract
A streaming style inference of encoder-decoder automatic speech recognition (ASR) system is important for reducing latency, which is essential for interactive use cases. To this end, we propose a novel blockwise synchronous decoding algorithm with a hybrid approach that combines endpoint prediction and endpoint post-determination. In the endpoint prediction, we compute the expectation of the number of tokens that are yet to be emitted in the encoder features of the current blocks using the CTC posterior. Based on the expectation value, the decoder predicts the endpoint to realize continuous block synchronization, as a running stitch. Meanwhile, endpoint post-determination probabilistically detects backward jump of the source-target attention, which is caused by the misprediction of endpoints. Then it resumes decoding by discarding those hypotheses, as back stitch. We combine these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
