Smart Speech Segmentation using Acousto-Linguistic Features with look-ahead
Piyush Behre, Naveen Parihar, Sharman Tan, Amy Shah, Eva Sharma,, Geoffrey Liu, Shuangyu Chang, Hosam Khalil, Chris Basoglu, Sayan Pathak

TL;DR
This paper introduces a hybrid speech segmentation method combining acoustic and linguistic features with look-ahead, significantly enhancing segmentation accuracy and downstream translation quality across multiple languages.
Contribution
It presents a novel hybrid segmentation approach that incorporates language understanding and look-ahead, outperforming traditional acoustic-only methods.
Findings
Segmentation-F0.5 score improved by 9.8% on average
BLEU score for machine translation increased by 1.05 points
Effective across multiple languages
Abstract
Segmentation for continuous Automatic Speech Recognition (ASR) has traditionally used silence timeouts or voice activity detectors (VADs), which are both limited to acoustic features. This segmentation is often overly aggressive, given that people naturally pause to think as they speak. Consequently, segmentation happens mid-sentence, hindering both punctuation and downstream tasks like machine translation for which high-quality segmentation is critical. Model-based segmentation methods that leverage acoustic features are powerful, but without an understanding of the language itself, these approaches are limited. We present a hybrid approach that leverages both acoustic and language information to improve segmentation. Furthermore, we show that including one word as a look-ahead boosts segmentation quality. On average, our models improve segmentation-F0.5 score by 9.8% over baseline. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
