IMTBench: A Multi-Scenario Cross-Modal Collaborative Evaluation Benchmark for In-Image Machine Translation
Jiahao Lyu, Pei Fu, Zhenhang Li, Weichao Zeng, Shaojie Zhang, Jiahui Yang, Can Ma, Yu Zhou, Zhenbo Luo, Jian Luan

TL;DR
IMTBench is a comprehensive benchmark for evaluating multi-scenario, cross-modal in-image machine translation, addressing real-world complexity and providing multi-aspect evaluation metrics.
Contribution
The paper introduces IMTBench, a new benchmark with diverse scenarios and evaluation metrics for in-image machine translation, filling gaps in existing synthetic benchmarks.
Findings
Strong commercial systems show large performance gaps across scenarios.
Natural scenes and resource-limited languages pose significant challenges.
IMTBench highlights substantial headroom for end-to-end image text translation improvements.
Abstract
End-to-end In-Image Machine Translation (IIMT) aims to convert text embedded within an image into a target language while preserving the original visual context, layout, and rendering style. However, existing IIMT benchmarks are largely synthetic and thus fail to reflect real-world complexity, while current evaluation protocols focus on single-modality metrics and overlook cross-modal faithfulness between rendered text and model outputs. To address these shortcomings, we present In-image Machine Translation Benchmark (IMTBench), a new benchmark of 2,500 image translation samples covering four practical scenarios and nine languages. IMTBench supports multi-aspect evaluation, including translation quality, background preservation, overall image quality, and a cross-modal alignment score that measures consistency between the translated text produced by the model and the text rendered in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
