Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering

Zeyu Liu; Weicong Liang; Zhanhao Liang; Chong Luo; Ji Li; and Gao Huang; Yuhui Yuan

arXiv:2403.09622·cs.CV·July 15, 2024·1 cites

Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering

Zeyu Liu, Weicong Liang, Zhanhao Liang, Chong Luo, Ji Li, and Gao Huang, Yuhui Yuan

PDF

Open Access 3 Models 1 Datasets

TL;DR

Glyph-ByT5 is a specialized text encoder designed to improve visual text rendering in text-to-image models, achieving near-perfect accuracy and enabling complex paragraph rendering through fine-tuning and integration with SDXL.

Contribution

The paper introduces Glyph-ByT5, a character-aware, glyph-aligned text encoder, and demonstrates its integration with SDXL to significantly enhance text rendering accuracy in design and real-world images.

Findings

01

Text rendering accuracy improved from <20% to nearly 90%.

02

Glyph-SDXL can render multi-line paragraphs with high spelling accuracy.

03

Fine-tuning with high-quality images boosts scene text rendering in open images.

Abstract

Visual text rendering poses a fundamental challenge for contemporary text-to-image generation models, with the core problem lying in text encoder deficiencies. To achieve accurate text rendering, we identify two crucial requirements for text encoders: character awareness and alignment with glyphs. Our solution involves crafting a series of customized text encoder, Glyph-ByT5, by fine-tuning the character-aware ByT5 encoder using a meticulously curated paired glyph-text dataset. We present an effective method for integrating Glyph-ByT5 with SDXL, resulting in the creation of the Glyph-SDXL model for design image generation. This significantly enhances text rendering accuracy, improving it from less than $20%$ to nearly $90%$ on our design image benchmark. Noteworthy is Glyph-SDXL's newfound ability for text paragraph rendering, achieving high spelling accuracy for tens to hundreds of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

GlyphByT5/GlyphByT5Pretraining
dataset· 16 dl
16 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Human Motion and Animation · Handwritten Text Recognition Techniques

MethodsSparse Evolutionary Training