Multilingual Speech Translation with Efficient Finetuning of Pretrained Models
Xian Li, Changhan Wang, Yun Tang, Chau Tran, Yuqing Tang, Juan Pino,, Alexei Baevski, Alexis Conneau, Michael Auli

TL;DR
This paper introduces a minimalistic finetuning method for multilingual speech translation that leverages pretrained models, achieving state-of-the-art results with low training cost and strong zero-shot capabilities.
Contribution
It demonstrates that finetuning less than 10% of pretrained parameters with LayerNorm and Attention enables effective multilingual speech translation and zero-shot transfer.
Findings
Achieved +6.4 BLEU on average across 15 En-X directions
Surpassed cascaded ST in 23 out of 34 directions
Demonstrated strong zero-shot performance with +5.7 BLEU on average
Abstract
We present a simple yet effective approach to build multilingual speech-to-text (ST) translation by efficient transfer learning from pretrained speech encoder and text decoder. Our key finding is that a minimalistic LNA (LayerNorm and Attention) finetuning can achieve zero-shot crosslingual and cross-modality transfer ability by only finetuning less than 10% of the pretrained parameters. This enables effectively leveraging large pretrained models with low training cost. Using wav2vec 2.0 for acoustic modeling, and mBART for multilingual text generation, our approach advanced the new state-of-the-art for 34 translation directions (and surpassing cascaded ST for 23 of them) on large-scale multilingual ST benchmark CoVoST 2 (+6.4 BLEU on average across 15 En-X directions and +5.1 BLEU on average across 19 X-En directions). Our approach demonstrates strong zero-shot performance in a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
MethodsmBART
