VADOI:Voice-Activity-Detection Overlapping Inference For End-to-end Long-form Speech Recognition
Jinhan Wang, Xiaosu Tong, Jinxi Guo, Di He, Roland Maas

TL;DR
This paper introduces VADOI, a voice-activity-detection based overlapping inference method for end-to-end long-form speech recognition, balancing accuracy and computational efficiency.
Contribution
It proposes VADOI, a novel inference technique that reduces computation cost by intelligently overlapping speech segments, maintaining WER performance.
Findings
Achieves 20% reduction in computation cost on Librispeech and translation corpus.
Maintains WER performance comparable to the best existing overlapping inference methods.
Introduces Soft-Match to address word misalignment issues.
Abstract
While end-to-end models have shown great success on the Automatic Speech Recognition task, performance degrades severely when target sentences are long-form. The previous proposed methods, (partial) overlapping inference are shown to be effective on long-form decoding. For both methods, word error rate (WER) decreases monotonically when overlapping percentage decreases. Setting aside computational cost, the setup with 50% overlapping during inference can achieve the best performance. However, a lower overlapping percentage has an advantage of fast inference speed. In this paper, we first conduct comprehensive experiments comparing overlapping inference and partial overlapping inference with various configurations. We then propose Voice-Activity-Detection Overlapping Inference to provide a trade-off between WER and computation cost. Results show that the proposed method can achieve a 20%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques
