E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR

W. Ronny Huang; Shuo-yiin Chang; David Rybach; Rohit Prabhavalkar,; Tara N. Sainath; Cyril Allauzen; Cal Peyser; Zhiyun Lu

arXiv:2204.10749·cs.SD·June 16, 2022

E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR

W. Ronny Huang, Shuo-yiin Chang, David Rybach, Rohit Prabhavalkar,, Tara N. Sainath, Cyril Allauzen, Cal Peyser, Zhiyun Lu

PDF

Open Access

TL;DR

This paper introduces an end-to-end ASR model that jointly segments and decodes long audio streams, improving accuracy and latency by integrating semantic understanding into segmentation decisions.

Contribution

The proposed model replaces traditional VAD with an end-to-end system that predicts segment boundaries using both acoustic and semantic features, enhancing long-form speech recognition.

Findings

01

8.5% relative WER improvement on long-form audio

02

250 ms reduction in median end-of-segment latency

03

Effective on real-world YouTube audio up to 30 minutes long

Abstract

Improving the performance of end-to-end ASR models on long utterances ranging from minutes to hours in length is an ongoing challenge in speech recognition. A common solution is to segment the audio in advance using a separate voice activity detector (VAD) that decides segment boundary locations based purely on acoustic speech/non-speech information. VAD segmenters, however, may be sub-optimal for real-world speech where, e.g., a complete sentence that should be taken as a whole may contain hesitations in the middle ("set an alarm for... 5 o'clock"). We propose to replace the VAD with an end-to-end ASR model capable of predicting segment boundaries in a streaming fashion, allowing the segmentation decision to be conditioned not only on better acoustic features but also on semantic features from the decoded text with negligible extra computation. In experiments on real world long-form…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing