Improving Diffusion Models for Scene Text Editing with Dual Encoders
Jiabao Ji, Guanhua Zhang, Zhaowen Wang, Bairu Hou, Zhifei Zhang, Brian, Price, Shiyu Chang

TL;DR
This paper introduces DIFFSTE, a dual encoder diffusion model that significantly improves scene text editing by enhancing text accuracy and style control, with strong zero-shot generalization capabilities demonstrated across multiple datasets.
Contribution
The paper proposes a novel dual encoder diffusion framework with instruction tuning, enabling better text rendering, style control, and zero-shot generalization in scene text editing.
Findings
Outperforms previous methods in text correctness and naturalness
Achieves effective style control and zero-shot font variation generation
Demonstrates superior results on five benchmark datasets
Abstract
Scene text editing is a challenging task that involves modifying or inserting specified texts in an image while maintaining its natural and realistic appearance. Most previous approaches to this task rely on style-transfer models that crop out text regions and feed them into image transfer models, such as GANs. However, these methods are limited in their ability to change text style and are unable to insert texts into images. Recent advances in diffusion models have shown promise in overcoming these limitations with text-conditional image editing. However, our empirical analysis reveals that state-of-the-art diffusion models struggle with rendering correct text and controlling text style. To address these problems, we propose DIFFSTE to improve pre-trained diffusion models with a dual encoder design, which includes a character encoder for better text legibility and an instruction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Multimodal Machine Learning Applications
MethodsDiffusion
