Multilingual Speech Translation with Unified Transformer: Huawei Noah's Ark Lab at IWSLT 2021
Xingshan Zeng, Liangyou Li, Qun Liu

TL;DR
This paper presents a unified transformer model for multilingual speech translation that leverages multi-task learning and data augmentation to improve performance across multiple languages and tasks, including speech recognition, translation, and speech translation.
Contribution
The paper introduces a unified transformer architecture that processes speech and text inputs jointly for multilingual tasks, enhancing performance through multi-task training and data augmentation techniques.
Findings
Outperforms bilingual baselines on supervised language pairs
Achieves reasonable results on zero-shot language pairs
Effective use of multi-task learning and data augmentation
Abstract
This paper describes the system submitted to the IWSLT 2021 Multilingual Speech Translation (MultiST) task from Huawei Noah's Ark Lab. We use a unified transformer architecture for our MultiST model, so that the data from different modalities (i.e., speech and text) and different tasks (i.e., Speech Recognition, Machine Translation, and Speech Translation) can be exploited to enhance the model's ability. Specifically, speech and text inputs are firstly fed to different feature extractors to extract acoustic and textual features, respectively. Then, these features are processed by a shared encoder--decoder architecture. We apply several training techniques to improve the performance, including multi-task learning, task-level curriculum learning, data augmentation, etc. Our final system achieves significantly better results than bilingual baselines on supervised language pairs and yields…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
