FST: the FAIR Speech Translation System for the IWSLT21 Multilingual Shared Task
Yun Tang, Hongyu Gong, Xian Li, Changhan Wang, Juan Pino, Holger, Schwenk, Naman Goyal

TL;DR
This paper presents FST, a multilingual speech translation system that leverages transfer learning and joint training to outperform existing approaches and achieve results comparable to text-based translation systems.
Contribution
The paper introduces a novel end-to-end multilingual speech translation system that combines transfer learning, joint training, and fine-tuning for improved performance.
Findings
Outperforms existing end-to-end and cascaded systems significantly.
Achieves translation quality comparable to text-based systems in some directions.
Utilizes large-scale pretraining and joint task training for knowledge transfer.
Abstract
In this paper, we describe our end-to-end multilingual speech translation system submitted to the IWSLT 2021 evaluation campaign on the Multilingual Speech Translation shared task. Our system is built by leveraging transfer learning across modalities, tasks and languages. First, we leverage general-purpose multilingual modules pretrained with large amounts of unlabelled and labelled data. We further enable knowledge transfer from the text task to the speech task by training two tasks jointly. Finally, our multilingual model is finetuned on speech translation task-specific data to achieve the best translation results. Experimental results show our system outperforms the reported systems, including both end-to-end and cascaded based approaches, by a large margin. In some translation directions, our speech translation results evaluated on the public Multilingual TEDx test set are even…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
