Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training
Wenbo Li, Guohao Li, Zhibin Lan, Xue Xu,Wanru Zhuang,Jiachen Liu,, Xinyan Xiao, Jinsong Su

TL;DR
This paper introduces methods to improve diffusion-based text-to-image models, enabling them to generate legible and accurate visual texts in English and Chinese by enhancing input representation and training strategies.
Contribution
The paper proposes a mixed granularity input strategy and glyph-aware training losses to significantly improve visual text generation in backbone models.
Findings
Enhanced models produce more legible and accurate visual texts
Maintained high-quality general image generation
Effective cross-attention learning for visual texts
Abstract
Diffusion-based text-to-image models have demonstrated impressive achievements in diversity and aesthetics but struggle to generate images with legible visual texts. Existing backbone models have limitations such as misspelling, failing to generate texts, and lack of support for Chinese text, but their development shows promising potential. In this paper, we propose a series of methods, aiming to empower backbone models to generate visual texts in English and Chinese. We first conduct a preliminary study revealing that Byte Pair Encoding (BPE) tokenization and the insufficient learning of cross-attention modules restrict the performance of the backbone models. Based on these observations, we make the following improvements: (1) We design a mixed granularity input strategy to provide more suitable text representations; (2) We propose to augment the conventional training objective with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Motion and Animation
MethodsFocus
