Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering
Zeyu Liu, Weicong Liang, Zhanhao Liang, Chong Luo, Ji Li, and Gao Huang, Yuhui Yuan

TL;DR
Glyph-ByT5 is a specialized text encoder designed to improve visual text rendering in text-to-image models, achieving near-perfect accuracy and enabling complex paragraph rendering through fine-tuning and integration with SDXL.
Contribution
The paper introduces Glyph-ByT5, a character-aware, glyph-aligned text encoder, and demonstrates its integration with SDXL to significantly enhance text rendering accuracy in design and real-world images.
Findings
Text rendering accuracy improved from <20% to nearly 90%.
Glyph-SDXL can render multi-line paragraphs with high spelling accuracy.
Fine-tuning with high-quality images boosts scene text rendering in open images.
Abstract
Visual text rendering poses a fundamental challenge for contemporary text-to-image generation models, with the core problem lying in text encoder deficiencies. To achieve accurate text rendering, we identify two crucial requirements for text encoders: character awareness and alignment with glyphs. Our solution involves crafting a series of customized text encoder, Glyph-ByT5, by fine-tuning the character-aware ByT5 encoder using a meticulously curated paired glyph-text dataset. We present an effective method for integrating Glyph-ByT5 with SDXL, resulting in the creation of the Glyph-SDXL model for design image generation. This significantly enhances text rendering accuracy, improving it from less than to nearly on our design image benchmark. Noteworthy is Glyph-SDXL's newfound ability for text paragraph rendering, achieving high spelling accuracy for tens to hundreds of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Human Motion and Animation · Handwritten Text Recognition Techniques
MethodsSparse Evolutionary Training
