DreamText: High Fidelity Scene Text Synthesis
Yibin Wang, Weizhong Zhang, Honghui Xu, Cheng Jin

TL;DR
DreamText introduces a novel high-fidelity scene text synthesis method that refines diffusion training with character-level guidance and joint encoder-generator training, significantly improving text rendering quality in images.
Contribution
The paper proposes a hybrid optimization approach with joint training of text encoder and generator to enhance character-level accuracy and font diversity in scene text synthesis.
Findings
Outperforms state-of-the-art methods qualitatively.
Achieves higher accuracy in character rendering.
Effectively handles diverse font styles.
Abstract
Scene text synthesis involves rendering specified texts onto arbitrary images. Current methods typically formulate this task in an end-to-end manner but lack effective character-level guidance during training. Besides, their text encoders, pre-trained on a single font type, struggle to adapt to the diverse font styles encountered in practical applications. Consequently, these methods suffer from character distortion, repetition, and absence, particularly in polystylistic scenarios. To this end, this paper proposes DreamText for high-fidelity scene text synthesis. Our key idea is to reconstruct the diffusion training process, introducing more refined guidance tailored to this task, to expose and rectify the model's attention at the character level and strengthen its learning of text regions. This transformation poses a hybrid optimization challenge, involving both discrete and continuous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Music and Audio Processing · Human Motion and Animation
MethodsDiffusion
