MT$^{3}$: Scaling MLLM-based Text Image Machine Translation via Multi-Task Reinforcement Learning
Zhaopeng Feng, Yupu Liang, Shaosheng Cao, Jiayuan Su, Jiahan Ren, Zhe Xu, Yao Hu, Wenxuan Huang, Jian Wu, Zuozhu Liu

TL;DR
This paper introduces MT$^{3}$, a multi-task reinforcement learning framework for end-to-end text image machine translation using multimodal large language models, achieving state-of-the-art results and strong generalization.
Contribution
It pioneers the application of multi-task RL to MLLMs for TIMT, integrating text recognition, reasoning, and translation with a novel reward mechanism and introducing a new social media TIMT benchmark.
Findings
State-of-the-art performance on MIT-10M benchmark
Strong generalization to out-of-distribution datasets
Effective multi-task synergy and reinforcement learning strategies
Abstract
Text Image Machine Translation (TIMT)-the task of translating textual content embedded in images-is critical for applications in accessibility, cross-lingual information access, and real-world document understanding. However, TIMT remains a complex challenge due to the need for accurate optical character recognition (OCR), robust visual-text reasoning, and high-quality translation, often requiring cascading multi-stage pipelines. Recent advances in large-scale Reinforcement Learning (RL) have improved reasoning in Large Language Models (LLMs) and Multimodal LLMs (MLLMs), but their application to end-to-end TIMT is still underexplored. To bridge this gap, we introduce MT, the first framework to apply Multi-Task RL to MLLMs for end-to-end TIMT. MT adopts a multi-task optimization paradigm targeting three key sub-skills: text recognition, context-aware reasoning, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Topic Modeling
