Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion
Yexing Du, Youcheng Pan, Zekun Wang, Zheng Chu, Yichong Huang, Kaiyuan Liu, Bo Yang, Yang Xiang, Ming Liu, Bing Qin

TL;DR
This paper introduces a scalable speech-guided multimodal machine translation framework that leverages speech-text fusion and self-evolution to achieve state-of-the-art results across multiple datasets and languages.
Contribution
It proposes a novel speech-guided translation method with a self-evolution mechanism, enabling scalable multilingual translation with synthetic speech integration.
Findings
Surpasses existing methods on Multi30K benchmark.
Achieves state-of-the-art performance on FLORES-200 in 108 directions.
Negligible impact of synthetic speech quality on translation performance.
Abstract
Multimodal Large Language Models (MLLMs) have achieved notable success in enhancing translation performance by integrating multimodal information. However, existing research primarily focuses on image-guided methods, whose applicability is constrained by the scarcity of multilingual image-text pairs. The speech modality overcomes this limitation due to its natural alignment with text and the abundance of existing speech datasets, which enable scalable language coverage. In this paper, we propose a Speech-guided Machine Translation (SMT) framework that integrates speech and text as fused inputs into an MLLM to improve translation quality. To mitigate reliance on low-resource data, we introduce a Self-Evolution Mechanism. The core components of this framework include a text-to-speech model, responsible for generating synthetic speech, and an MLLM capable of classifying synthetic speech…
Peer Reviews
Decision·ICLR 2026 Poster
It's interesting to see that adding synthetic speech to the input of a text translation system improves performances. This seems to be more effective than adding images to the input (cf. Table 3). The incremental training strategy is interesting and I wonder whether similar techniques could be used in other LLM settings than MT.
The experimental results show that the method works, but the authors fail to justify, or to try to explain, why it works! The only argument is that the prosody in speech helps translation. I could agree with this if human speech were used. However, this work addresses the issue how to improve text translation by providing in addition *synthetic speech*. I am not well aware of the current SOTA in TTS, but I would be surprised that the prosody is very rich. It is not clear which data was used to
- Propose a Speech-guided Multimodal Machine Translation (SMMT) framework - The system achieves very competitive results on multiple datasets
- It is understandable that information from multiple sources might help the machine learn better representation. But those features are real data instead of synthetic data. In this work, synthetic speech is used augmented MMT. Deep analysis is required to reveal the main factors that contribute the good performance. For example, we don't know the synthetic speech samples are negative or positive during inference time. What's the impact if the provided speech sample is negative one? How to selec
- Innovative self-evolution mechanism: The framework introduces a novel self-improvement process that leverages translation performance metrics as optimization objectives, facilitating continuous enhancement through iterative evolution cycles. - Comprehensive multimodal and multilingual support: The approach effectively unifies speech and text modalities and extends to multiple languages, showing potential for broader generalization.
- Limited evaluation scope: Experiments are primarily conducted on a small set of benchmarks (e.g., Multi30K, FLORES-200), leaving open questions about generalization to large-scale and domain-diverse datasets. - Dependence on external components: The framework relies heavily on open-source models, such as the TTS model, which may constrain performance consistency and scalability. - Lack of detailed ablations: It remains unclear how much each component (speech input, multimodal fusion, self-evol
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
