Speech Segmentation Optimization using Segmented Bilingual Speech Corpus for End-to-end Speech Translation
Ryo Fukuda, Katsuhito Sudoh, Satoshi Nakamura

TL;DR
This paper introduces a novel speech segmentation method using a binary classifier trained on bilingual speech data, improving segmentation accuracy for speech translation systems, especially when combined with traditional VAD techniques.
Contribution
The study presents a new segmentation approach using a binary classification model trained on bilingual speech, enhancing translation performance over conventional pause-based methods.
Findings
Proposed method outperforms traditional VAD in segmentation accuracy.
Hybrid VAD and classification approach further improves translation results.
Method is effective for both cascade and end-to-end speech translation systems.
Abstract
Speech segmentation, which splits long speech into short segments, is essential for speech translation (ST). Popular VAD tools like WebRTC VAD have generally relied on pause-based segmentation. Unfortunately, pauses in speech do not necessarily match sentence boundaries, and sentences can be connected by a very short pause that is difficult to detect by VAD. In this study, we propose a speech segmentation method using a binary classification model trained using a segmented bilingual speech corpus. We also propose a hybrid method that combines VAD and the above speech segmentation method. Experimental results revealed that the proposed method is more suitable for cascade and end-to-end ST systems than conventional segmentation methods. The hybrid approach further improved the translation performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Speech and dialogue systems
