MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages

Yexing Du; Kaiyuan Liu; Youcheng Pan; Bo Yang; Keqi Deng; Xie Chen; Yang Xiang; Ming Liu; Bing Qin; YaoWei Wang

arXiv:2512.01512·cs.CL·April 14, 2026

MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages

Yexing Du, Kaiyuan Liu, Youcheng Pan, Bo Yang, Keqi Deng, Xie Chen, Yang Xiang, Ming Liu, Bing Qin, YaoWei Wang

PDF

1 Repo 2 Models

TL;DR

The paper introduces MCAT, a framework that significantly expands multilingual speech-to-text translation to 70 languages and improves efficiency by reducing speech sequence length, outperforming existing models.

Contribution

It presents a novel multilingual scaling method and an optimized speech adapter to enhance language coverage and inference speed in MLLMs for speech translation.

Findings

01

Achieves translation among 70 languages with mutual translation capabilities.

02

Surpasses state-of-the-art models on the FLEURS dataset across 70x69 directions.

03

Improves inference efficiency by reducing speech sequence length to 30 tokens.

Abstract

Multimodal Large Language Models (MLLMs) have achieved great success in Speech-to-Text Translation (S2TT) tasks. However, current research is constrained by two key challenges: language coverage and efficiency. Most of the popular S2TT datasets are substantially English-centric, which restricts the scaling-up of MLLMs' many-to-many translation capabilities. Moreover, the inference speed of MLLMs degrades dramatically when the speech is converted into long sequences (e.g., 750 tokens). To address these limitations, we propose a Multilingual Cost-effective Accelerated Speech-to-Text Translator (MCAT) framework, which includes two innovations. First, a language scaling method that leverages curriculum learning and a data balancing strategy is introduced to extend the language coverage supported by MLLMs to 70 languages and achieve mutual translation among these languages. Second, an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yxduir/m2m-70
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.