MT$^{3}$: Scaling MLLM-based Text Image Machine Translation via Multi-Task Reinforcement Learning

Zhaopeng Feng; Yupu Liang; Shaosheng Cao; Jiayuan Su; Jiahan Ren; Zhe Xu; Yao Hu; Wenxuan Huang; Jian Wu; Zuozhu Liu

arXiv:2505.19714·cs.CL·May 27, 2025

MT$^{3}$: Scaling MLLM-based Text Image Machine Translation via Multi-Task Reinforcement Learning

Zhaopeng Feng, Yupu Liang, Shaosheng Cao, Jiayuan Su, Jiahan Ren, Zhe Xu, Yao Hu, Wenxuan Huang, Jian Wu, Zuozhu Liu

PDF

Open Access

TL;DR

This paper introduces MT$^{3}$, a multi-task reinforcement learning framework for end-to-end text image machine translation using multimodal large language models, achieving state-of-the-art results and strong generalization.

Contribution

It pioneers the application of multi-task RL to MLLMs for TIMT, integrating text recognition, reasoning, and translation with a novel reward mechanism and introducing a new social media TIMT benchmark.

Findings

01

State-of-the-art performance on MIT-10M benchmark

02

Strong generalization to out-of-distribution datasets

03

Effective multi-task synergy and reinforcement learning strategies

Abstract

Text Image Machine Translation (TIMT)-the task of translating textual content embedded in images-is critical for applications in accessibility, cross-lingual information access, and real-world document understanding. However, TIMT remains a complex challenge due to the need for accurate optical character recognition (OCR), robust visual-text reasoning, and high-quality translation, often requiring cascading multi-stage pipelines. Recent advances in large-scale Reinforcement Learning (RL) have improved reasoning in Large Language Models (LLMs) and Multimodal LLMs (MLLMs), but their application to end-to-end TIMT is still underexplored. To bridge this gap, we introduce MT $^{3}$ , the first framework to apply Multi-Task RL to MLLMs for end-to-end TIMT. MT $^{3}$ adopts a multi-task optimization paradigm targeting three key sub-skills: text recognition, context-aware reasoning, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Topic Modeling