Making LLMs Better Many-to-Many Speech-to-Text Translators with Curriculum Learning
Yexing Du, Youcheng Pan, Ziyang Ma, Bo Yang, Yifan Yang, Keqi Deng, Xie Chen, Yang Xiang, Ming Liu, Bing Qin

TL;DR
This paper introduces a three-stage curriculum learning approach to improve many-to-many speech-to-text translation with multimodal large language models, especially in low-resource language settings, achieving state-of-the-art results.
Contribution
The paper presents a novel curriculum learning strategy that adapts large language models for effective many-to-many speech translation in low-resource scenarios.
Findings
Achieves state-of-the-art performance on 15×14 language pairs.
Requires less than 10 hours of speech data per language.
Effective across models with 3B, 7B, and 32B parameters.
Abstract
Multimodal Large Language Models (MLLMs) have achieved significant success in Speech-to-Text Translation (S2TT) tasks. While most existing research has focused on English-centric translation directions, the exploration of many-to-many translation is still limited by the scarcity of parallel data. To address this, we propose a three-stage curriculum learning strategy that leverages the machine translation capabilities of large language models and adapts them to S2TT tasks, enabling effective learning in low-resource settings. We trained MLLMs with varying parameter sizes (3B, 7B, and 32B) and evaluated the proposed strategy using the FLEURS and CoVoST-2 datasets. Experimental results show that the proposed strategy achieves state-of-the-art average performance in language pairs, requiring fewer than 10 hours of speech data per language to achieve competitive results. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Translation Studies and Practices · Speech and dialogue systems
