Lightweight Audio Segmentation for Long-form Speech Translation

Jaesong Lee; Soyoon Kim; Hanbyul Kim; Joon Son Chung

arXiv:2406.10549·eess.AS·June 18, 2024

Lightweight Audio Segmentation for Long-form Speech Translation

Jaesong Lee, Soyoon Kim, Hanbyul Kim, Joon Son Chung

PDF

Open Access

TL;DR

This paper introduces a lightweight speech segmentation model optimized for long-form speech translation, using an ASR-with-punctuation pre-training strategy to improve translation quality while reducing computational requirements.

Contribution

The work presents a novel, small-sized segmentation model with an effective pre-training method tailored for speech translation systems, addressing performance gaps and resource constraints.

Findings

01

The proposed model improves translation quality compared to existing methods.

02

Pre-training with ASR-with-punctuation enhances segmentation effectiveness.

03

Proper integration into ST systems is crucial for optimal performance.

Abstract

Speech segmentation is an essential part of speech translation (ST) systems in real-world scenarios. Since most ST models are designed to process speech segments, long-form audio must be partitioned into shorter segments before translation. Recently, data-driven approaches for the speech segmentation task have been developed. Although the approaches improve overall translation quality, a performance gap exists due to a mismatch between the models and ST systems. In addition, the prior works require large self-supervised speech models, which consume significant computational resources. In this work, we propose a segmentation model that achieves better speech translation quality with a small model size. We propose an ASR-with-punctuation task as an effective pre-training strategy for the segmentation model. We also show that proper integration of the speech segmentation model into the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing