TextMastero: Mastering High-Quality Scene Text Editing in Diverse Languages and Styles
Tong Wang, Xiaochao Qu, Ting Liu

TL;DR
TextMastero is a novel multilingual scene text editing framework based on latent diffusion models that significantly improves text accuracy and style preservation, especially for complex scripts like CJK characters.
Contribution
It introduces glyph conditioning and latent guidance modules to enhance text fidelity and style consistency in scene text editing across diverse languages and styles.
Findings
Outperforms existing methods in text fidelity.
Achieves superior style similarity in edited images.
Handles complex scripts like CJK effectively.
Abstract
Scene text editing aims to modify texts on images while maintaining the style of newly generated text similar to the original. Given an image, a target area, and target text, the task produces an output image with the target text in the selected area, replacing the original. This task has been studied extensively, with initial success using Generative Adversarial Networks (GANs) to balance text fidelity and style similarity. However, GAN-based methods struggled with complex backgrounds or text styles. Recent works leverage diffusion models, showing improved results, yet still face challenges, especially with non-Latin languages like CJK characters (Chinese, Japanese, Korean) that have complex glyphs, often producing inaccurate or unrecognizable characters. To address these issues, we present \emph{TextMastero} - a carefully designed multilingual scene text editing architecture based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsDiffusion
