MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition-Perception-Reasoning Guided Text-Image Machine Translation

Gengluo Li; Chengquan Zhang; Yupu Liang; Huawen Shen; Yaping Zhang; Pengyuan Lyu; Weinong Wang; Xingyu Wan; Gangyan Zeng; Han Hu; Can Ma; Yu Zhou

arXiv:2603.23896·cs.CV·March 26, 2026

MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition-Perception-Reasoning Guided Text-Image Machine Translation

Gengluo Li, Chengquan Zhang, Yupu Liang, Huawen Shen, Yaping Zhang, Pengyuan Lyu, Weinong Wang, Xingyu Wan, Gangyan Zeng, Han Hu, Can Ma, Yu Zhou

PDF

Open Access

TL;DR

This paper introduces MMTIT-Bench, a comprehensive multilingual and multi-scenario benchmark for text-image machine translation, and proposes CPR-Trans, a reasoning-oriented data paradigm that enhances translation accuracy and interpretability.

Contribution

The paper presents MMTIT-Bench, a new benchmark with diverse languages and scenarios, and introduces CPR-Trans, a novel cognition-perception-reasoning framework for improved TIMT.

Findings

01

CPR-Trans improves translation accuracy on 3B and 7B models.

02

Structured supervision enhances interpretability of VLLMs.

03

Benchmark enables rigorous evaluation across diverse languages and scenarios.

Abstract

End-to-end text-image machine translation (TIMT), which directly translates textual content in images across languages, is crucial for real-world multilingual scene understanding. Despite advances in vision-language large models (VLLMs), robustness across diverse visual scenes and low-resource languages remains underexplored due to limited evaluation resources. We present MMTIT-Bench, a human-verified multilingual and multi-scenario benchmark with 1,400 images spanning fourteen non-English and non-Chinese languages and diverse settings such as documents, scenes, and web images, enabling rigorous assessment of end-to-end TIMT. Beyond benchmarking, we study how reasoning-oriented data design improves translation. Although recent VLLMs have begun to incorporate long Chain-of-Thought (CoT) reasoning, effective thinking paradigms for TIMT are still immature: existing designs either cascade…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Natural Language Processing Techniques