Global-Local Dual Perception for MLLMs in High-Resolution Text-Rich Image Translation
Junxin Lu, Tengfei Song, Zhanglin Wu, Pengfei Li, Xiaowei Liang, Hui Yang, Kun Chen, Ning Xie, Yunfei Lu, Jing Zhao, Shiliang Sun, Daimeng Wei

TL;DR
This paper introduces GLoTran, a dual perception framework combining global and local visual cues to enhance high-resolution, text-rich image translation in multimodal large language models, addressing existing challenges of clutter and detail loss.
Contribution
GLoTran is a novel dual perception approach integrating global and local visual information, along with a large-scale dataset GLoD, to improve TIMT performance in complex high-resolution scenarios.
Findings
Significantly improves translation completeness and accuracy
Outperforms state-of-the-art MLLMs in high-resolution TIMT
Provides a new paradigm for fine-grained text-rich image translation
Abstract
Text Image Machine Translation (TIMT) aims to translate text embedded in images in the source-language into target-language, requiring synergistic integration of visual perception and linguistic understanding. Existing TIMT methods, whether cascaded pipelines or end-to-end multimodal large language models (MLLMs),struggle with high-resolution text-rich images due to cluttered layouts, diverse fonts, and non-textual distractions, resulting in text omission, semantic drift, and contextual inconsistency. To address these challenges, we propose GLoTran, a global-local dual visual perception framework for MLLM-based TIMT. GLoTran integrates a low-resolution global image with multi-scale region-level text image slices under an instruction-guided alignment strategy, conditioning MLLMs to maintain scene-level contextual consistency while faithfully capturing fine-grained textual details.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Natural Language Processing Techniques · Multimodal Machine Learning Applications
