VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording
Hirofumi Inaguma, Tatsuya Kawahara

TL;DR
This paper introduces a novel streaming ASR method that operates on unsegmented long recordings without VAD, using a hybrid CTC/attention model with new decoding algorithms for improved robustness and efficiency.
Contribution
It presents a VAD-free inference algorithm and a block-synchronous beam search decoding for streaming ASR on unsegmented long-form speech.
Findings
Achieves comparable accuracy with label-synchronous decoding.
Robustly recognizes long-form speech for hours.
Effective VAD-free inference leveraging CTC probabilities.
Abstract
In this work, we propose novel decoding algorithms to enable streaming automatic speech recognition (ASR) on unsegmented long-form recordings without voice activity detection (VAD), based on monotonic chunkwise attention (MoChA) with an auxiliary connectionist temporal classification (CTC) objective. We propose a block-synchronous beam search decoding to take advantage of efficient batched output-synchronous and low-latency input-synchronous searches. We also propose a VAD-free inference algorithm that leverages CTC probabilities to determine a suitable timing to reset the model states to tackle the vulnerability to long-form data. Experimental evaluations demonstrate that the block-synchronous decoding achieves comparable accuracy to the label-synchronous one. Moreover, the VAD-free inference can recognize long-form speech robustly for up to a few hours.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
