TL;DR
Pre-training on high-resource speech recognition tasks significantly enhances low-resource speech-to-text translation performance by leveraging pre-trained acoustic models, even across different languages.
Contribution
This paper introduces a simple pre-training approach on high-resource ASR data to improve low-resource speech translation, demonstrating effectiveness across multiple language pairs.
Findings
Pre-training on English ASR data doubles Spanish-English ST BLEU score.
Pre-training on French ASR also benefits Spanish-English ST.
The approach improves Mboshi-French ST from 3.5 to 7.1 BLEU with minimal data.
Abstract
We present a simple approach to improve direct speech-to-text translation (ST) when the source language is low-resource: we pre-train the model on a high-resource automatic speech recognition (ASR) task, and then fine-tune its parameters for ST. We demonstrate that our approach is effective by pre-training on 300 hours of English ASR data to improve Spanish-English ST from 10.8 to 20.2 BLEU when only 20 hours of Spanish-English ST training data are available. Through an ablation study, we find that the pre-trained encoder (acoustic model) accounts for most of the improvement, despite the fact that the shared language in these tasks is the target language text, not the source language audio. Applying this insight, we show that pre-training on ASR helps ST even when the ASR language differs from both source and target ST languages: pre-training on French ASR also improves Spanish-English…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
