Global-Local Dual Perception for MLLMs in High-Resolution Text-Rich Image Translation

Junxin Lu; Tengfei Song; Zhanglin Wu; Pengfei Li; Xiaowei Liang; Hui Yang; Kun Chen; Ning Xie; Yunfei Lu; Jing Zhao; Shiliang Sun; Daimeng Wei

arXiv:2602.21956·cs.CV·February 26, 2026

Global-Local Dual Perception for MLLMs in High-Resolution Text-Rich Image Translation

Junxin Lu, Tengfei Song, Zhanglin Wu, Pengfei Li, Xiaowei Liang, Hui Yang, Kun Chen, Ning Xie, Yunfei Lu, Jing Zhao, Shiliang Sun, Daimeng Wei

PDF

Open Access

TL;DR

This paper introduces GLoTran, a dual perception framework combining global and local visual cues to enhance high-resolution, text-rich image translation in multimodal large language models, addressing existing challenges of clutter and detail loss.

Contribution

GLoTran is a novel dual perception approach integrating global and local visual information, along with a large-scale dataset GLoD, to improve TIMT performance in complex high-resolution scenarios.

Findings

01

Significantly improves translation completeness and accuracy

02

Outperforms state-of-the-art MLLMs in high-resolution TIMT

03

Provides a new paradigm for fine-grained text-rich image translation

Abstract

Text Image Machine Translation (TIMT) aims to translate text embedded in images in the source-language into target-language, requiring synergistic integration of visual perception and linguistic understanding. Existing TIMT methods, whether cascaded pipelines or end-to-end multimodal large language models (MLLMs),struggle with high-resolution text-rich images due to cluttered layouts, diverse fonts, and non-textual distractions, resulting in text omission, semantic drift, and contextual inconsistency. To address these challenges, we propose GLoTran, a global-local dual visual perception framework for MLLM-based TIMT. GLoTran integrates a low-resolution global image with multi-scale region-level text image slices under an instruction-guided alignment strategy, conditioning MLLMs to maintain scene-level contextual consistency while faithfully capturing fine-grained textual details.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Natural Language Processing Techniques · Multimodal Machine Learning Applications