SegAugment: Maximizing the Utility of Speech Translation Data with Segmentation-based Augmentations
Ioannis Tsiamas, Jos\'e A. R. Fonollosa, Marta R. Costa-juss\`a

TL;DR
SegAugment introduces a segmentation-based data augmentation method for speech translation that generates multiple sentence-level variants, improving translation quality across multiple languages and closing the gap between manual and automatic segmentation.
Contribution
The paper presents a novel segmentation-based augmentation strategy, SegAugment, which enhances speech translation datasets by creating diverse sentence-level versions, leading to improved performance.
Findings
Consistent BLEU score improvements across eight language pairs.
Up to 5 BLEU points gain in low-resource scenarios.
State-of-the-art results on MuST-C dataset.
Abstract
End-to-end Speech Translation is hindered by a lack of available data resources. While most of them are based on documents, a sentence-level version is available, which is however single and static, potentially impeding the usefulness of the data. We propose a new data augmentation strategy, SegAugment, to address this issue by generating multiple alternative sentence-level versions of a dataset. Our method utilizes an Audio Segmentation system, which re-segments the speech of each document with different length constraints, after which we obtain the target text via alignment methods. Experiments demonstrate consistent gains across eight language pairs in MuST-C, with an average increase of 2.5 BLEU points, and up to 5 BLEU for low-resource scenarios in mTEDx. Furthermore, when combined with a strong system, SegAugment establishes new state-of-the-art results in MuST-C. Finally, we show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Music and Audio Processing
