Empowering Backbone Models for Visual Text Generation with Input   Granularity Control and Glyph-Aware Training

Wenbo Li; Guohao Li; Zhibin Lan; Xue Xu,Wanru Zhuang,Jiachen Liu,; Xinyan Xiao; Jinsong Su

arXiv:2410.04439·cs.CV·October 8, 2024

Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training

Wenbo Li, Guohao Li, Zhibin Lan, Xue Xu,Wanru Zhuang,Jiachen Liu,, Xinyan Xiao, Jinsong Su

PDF

Open Access 1 Video

TL;DR

This paper introduces methods to improve diffusion-based text-to-image models, enabling them to generate legible and accurate visual texts in English and Chinese by enhancing input representation and training strategies.

Contribution

The paper proposes a mixed granularity input strategy and glyph-aware training losses to significantly improve visual text generation in backbone models.

Findings

01

Enhanced models produce more legible and accurate visual texts

02

Maintained high-quality general image generation

03

Effective cross-attention learning for visual texts

Abstract

Diffusion-based text-to-image models have demonstrated impressive achievements in diversity and aesthetics but struggle to generate images with legible visual texts. Existing backbone models have limitations such as misspelling, failing to generate texts, and lack of support for Chinese text, but their development shows promising potential. In this paper, we propose a series of methods, aiming to empower backbone models to generate visual texts in English and Chinese. We first conduct a preliminary study revealing that Byte Pair Encoding (BPE) tokenization and the insufficient learning of cross-attention modules restrict the performance of the backbone models. Based on these observations, we make the following improvements: (1) We design a mixed granularity input strategy to provide more suitable text representations; (2) We propose to augment the conventional training objective with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Motion and Animation

MethodsFocus