Lightweight Audio Segmentation for Long-form Speech Translation
Jaesong Lee, Soyoon Kim, Hanbyul Kim, Joon Son Chung

TL;DR
This paper introduces a lightweight speech segmentation model optimized for long-form speech translation, using an ASR-with-punctuation pre-training strategy to improve translation quality while reducing computational requirements.
Contribution
The work presents a novel, small-sized segmentation model with an effective pre-training method tailored for speech translation systems, addressing performance gaps and resource constraints.
Findings
The proposed model improves translation quality compared to existing methods.
Pre-training with ASR-with-punctuation enhances segmentation effectiveness.
Proper integration into ST systems is crucial for optimal performance.
Abstract
Speech segmentation is an essential part of speech translation (ST) systems in real-world scenarios. Since most ST models are designed to process speech segments, long-form audio must be partitioned into shorter segments before translation. Recently, data-driven approaches for the speech segmentation task have been developed. Although the approaches improve overall translation quality, a performance gap exists due to a mismatch between the models and ST systems. In addition, the prior works require large self-supervised speech models, which consume significant computational resources. In this work, we propose a segmentation model that achieves better speech translation quality with a small model size. We propose an ASR-with-punctuation task as an effective pre-training strategy for the segmentation model. We also show that proper integration of the speech segmentation model into the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
