E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR
W. Ronny Huang, Shuo-yiin Chang, David Rybach, Rohit Prabhavalkar,, Tara N. Sainath, Cyril Allauzen, Cal Peyser, Zhiyun Lu

TL;DR
This paper introduces an end-to-end ASR model that jointly segments and decodes long audio streams, improving accuracy and latency by integrating semantic understanding into segmentation decisions.
Contribution
The proposed model replaces traditional VAD with an end-to-end system that predicts segment boundaries using both acoustic and semantic features, enhancing long-form speech recognition.
Findings
8.5% relative WER improvement on long-form audio
250 ms reduction in median end-of-segment latency
Effective on real-world YouTube audio up to 30 minutes long
Abstract
Improving the performance of end-to-end ASR models on long utterances ranging from minutes to hours in length is an ongoing challenge in speech recognition. A common solution is to segment the audio in advance using a separate voice activity detector (VAD) that decides segment boundary locations based purely on acoustic speech/non-speech information. VAD segmenters, however, may be sub-optimal for real-world speech where, e.g., a complete sentence that should be taken as a whole may contain hesitations in the middle ("set an alarm for... 5 o'clock"). We propose to replace the VAD with an end-to-end ASR model capable of predicting segment boundaries in a streaming fashion, allowing the segmentation decision to be conditioned not only on better acoustic features but also on semantic features from the decoded text with negligible extra computation. In experiments on real world long-form…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
