Leveraging Synthetic Audio Data for End-to-End Low-Resource Speech Translation
Yasmin Moslem

TL;DR
This paper explores the use of synthetic audio data and augmentation techniques to improve end-to-end Irish-to-English speech translation systems based on Whisper, demonstrating how data diversity enhances translation performance.
Contribution
It introduces novel data augmentation strategies, including speech back-translation and noise augmentation, for low-resource speech translation systems.
Findings
Synthetic data improves translation accuracy
Data augmentation enhances signal diversity
End-to-end models outperform traditional pipelines
Abstract
This paper describes our system submission to the International Conference on Spoken Language Translation (IWSLT 2024) for Irish-to-English speech translation. We built end-to-end systems based on Whisper, and employed a number of data augmentation techniques, such as speech back-translation and noise augmentation. We investigate the effect of using synthetic audio data and discuss several methods for enriching signal diversity.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
