MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing
Liwei Cheng, Shibo Feng, Lunjie Zhou, Yixuan Guan, Dayan Guan

TL;DR
This paper introduces MULTITEXTEDIT, a comprehensive benchmark for evaluating cross-lingual text-in-image editing, highlighting significant language-specific challenges and proposing a novel language fidelity metric.
Contribution
It presents a new multilingual benchmark with a specialized language fidelity metric to assess cross-lingual performance in text-in-image editing systems.
Findings
Pronounced cross-lingual degradation observed across models.
Largest errors in Hebrew and Arabic, smallest in Dutch and Spanish.
Outputs often preserve layout but distort script-specific text.
Abstract
Text-in-image editing has become a key capability for visual content creation, yet existing benchmarks remain overwhelmingly English-centric and often conflate visual plausibility with semantic correctness. We introduce MULTITEXTEDIT, a controlled benchmark of 3,600 instances spanning 12 typologically diverse languages, 5 visual domains, and 7 editing operations. Language variants of each instance share a common visual base and are paired with a human-edited reference and region masks, isolating the language variable for cross-lingual comparison. To capture script-level errors that coarse text-matching metrics miss, such as missing diacritics, reversed RTL order, and mixed-script renderings, we introduce a language fidelity (LSF) metric scored by a two-stage LVM protocol that first traces the edited target text and then judges it in isolation, reaching a quadratic-weighted \k{appa} of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
