Multilingual Speech Translation with Efficient Finetuning of Pretrained   Models

Xian Li; Changhan Wang; Yun Tang; Chau Tran; Yuqing Tang; Juan Pino,; Alexei Baevski; Alexis Conneau; Michael Auli

arXiv:2010.12829·cs.CL·January 5, 2021·27 cites

Multilingual Speech Translation with Efficient Finetuning of Pretrained Models

Xian Li, Changhan Wang, Yun Tang, Chau Tran, Yuqing Tang, Juan Pino,, Alexei Baevski, Alexis Conneau, Michael Auli

PDF

Open Access

TL;DR

This paper introduces a minimalistic finetuning method for multilingual speech translation that leverages pretrained models, achieving state-of-the-art results with low training cost and strong zero-shot capabilities.

Contribution

It demonstrates that finetuning less than 10% of pretrained parameters with LayerNorm and Attention enables effective multilingual speech translation and zero-shot transfer.

Findings

01

Achieved +6.4 BLEU on average across 15 En-X directions

02

Surpassed cascaded ST in 23 out of 34 directions

03

Demonstrated strong zero-shot performance with +5.7 BLEU on average

Abstract

We present a simple yet effective approach to build multilingual speech-to-text (ST) translation by efficient transfer learning from pretrained speech encoder and text decoder. Our key finding is that a minimalistic LNA (LayerNorm and Attention) finetuning can achieve zero-shot crosslingual and cross-modality transfer ability by only finetuning less than 10% of the pretrained parameters. This enables effectively leveraging large pretrained models with low training cost. Using wav2vec 2.0 for acoustic modeling, and mBART for multilingual text generation, our approach advanced the new state-of-the-art for 34 translation directions (and surpassing cascaded ST for 23 of them) on large-scale multilingual ST benchmark CoVoST 2 (+6.4 BLEU on average across 15 En-X directions and +5.1 BLEU on average across 19 X-En directions). Our approach demonstrates strong zero-shot performance in a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling

MethodsmBART