Harnessing Indirect Training Data for End-to-End Automatic Speech Translation: Tricks of the Trade
Juan Pino, Liezl Puzon, Jiatao Gu, Xutai Ma, Arya D. McCarthy, Deepak, Gopinath

TL;DR
This paper explores data augmentation and pretraining techniques to improve end-to-end automatic speech translation, significantly narrowing the performance gap with cascade models by leveraging indirect training data.
Contribution
It systematically evaluates augmentation and pretraining methods for AST, providing practical recommendations and demonstrating substantial performance improvements.
Findings
Data augmentation by translating transcripts is highly effective.
End-to-end models can approach cascade model performance with proper techniques.
Transformer architecture further reduces the performance gap.
Abstract
For automatic speech translation (AST), end-to-end approaches are outperformed by cascaded models that transcribe with automatic speech recognition (ASR), then translate with machine translation (MT). A major cause of the performance gap is that, while existing AST corpora are small, massive datasets exist for both the ASR and MT subsystems. In this work, we evaluate several data augmentation and pretraining approaches for AST, by comparing all on the same datasets. Simple data augmentation by translating ASR transcripts proves most effective on the English--French augmented LibriSpeech dataset, closing the performance gap from 8.2 to 1.4 BLEU, compared to a very strong cascade that could directly utilize copious ASR and MT data. The same end-to-end approach plus fine-tuning closes the gap on the English--Romanian MuST-C dataset from 6.7 to 3.7 BLEU. In addition to these results, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
