VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording

Hirofumi Inaguma; Tatsuya Kawahara

arXiv:2107.07509·eess.AS·July 16, 2021

VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording

Hirofumi Inaguma, Tatsuya Kawahara

PDF

TL;DR

This paper introduces a novel streaming ASR method that operates on unsegmented long recordings without VAD, using a hybrid CTC/attention model with new decoding algorithms for improved robustness and efficiency.

Contribution

It presents a VAD-free inference algorithm and a block-synchronous beam search decoding for streaming ASR on unsegmented long-form speech.

Findings

01

Achieves comparable accuracy with label-synchronous decoding.

02

Robustly recognizes long-form speech for hours.

03

Effective VAD-free inference leveraging CTC probabilities.

Abstract

In this work, we propose novel decoding algorithms to enable streaming automatic speech recognition (ASR) on unsegmented long-form recordings without voice activity detection (VAD), based on monotonic chunkwise attention (MoChA) with an auxiliary connectionist temporal classification (CTC) objective. We propose a block-synchronous beam search decoding to take advantage of efficient batched output-synchronous and low-latency input-synchronous searches. We also propose a VAD-free inference algorithm that leverages CTC probabilities to determine a suitable timing to reset the model states to tackle the vulnerability to long-form data. Experimental evaluations demonstrate that the block-synchronous decoding achieves comparable accuracy to the label-synchronous one. Moreover, the VAD-free inference can recognize long-form speech robustly for up to a few hours.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.