On Using SpecAugment for End-to-End Speech Translation
Parnia Bahar, Albert Zeyer, Ralf Schl\"uter, Hermann Ney

TL;DR
This paper explores SpecAugment, a simple data augmentation method applied directly to audio features, which improves end-to-end speech translation performance across different datasets and data scenarios.
Contribution
It demonstrates that SpecAugment effectively enhances speech translation accuracy and robustness, with consistent gains across multiple datasets and data conditions.
Findings
Up to +2.2% BLEU on LibriSpeech En->Fr
Up to +1.2% BLEU on IWSLT En->De
Effective in various data scenarios regardless of data size
Abstract
This work investigates a simple data augmentation technique, SpecAugment, for end-to-end speech translation. SpecAugment is a low-cost implementation method applied directly to the audio input features and it consists of masking blocks of frequency channels, and/or time steps. We apply SpecAugment on end-to-end speech translation tasks and achieve up to +2.2\% \BLEU on LibriSpeech Audiobooks En->Fr and +1.2% on IWSLT TED-talks En->De by alleviating overfitting to some extent. We also examine the effectiveness of the method in a variety of data scenarios and show that the method also leads to significant improvements in various data conditions irrespective of the amount of training data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Music and Audio Processing · Speech Recognition and Synthesis
