Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual   Visual Text Rendering

Zeyu Liu; Weicong Liang; Yiming Zhao; Bohan Chen; Lin; Liang; Lijuan Wang; Ji Li; Yuhui Yuan

arXiv:2406.10208·cs.CV·July 15, 2024

Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering

Zeyu Liu, Weicong Liang, Yiming Zhao, Bohan Chen, Lin, Liang, Lijuan Wang, Ji Li, Yuhui Yuan

PDF

Open Access 3 Models

TL;DR

Glyph-ByT5-v2 and Glyph-SDXL-v2 significantly improve multilingual visual text rendering accuracy and aesthetic quality across 10 languages, surpassing existing models like DALL-E3 and Ideogram 1.0.

Contribution

The paper introduces new multilingual datasets, benchmarks, and a preference learning approach to enhance visual text rendering and aesthetics in graphic design images.

Findings

01

Supports accurate spelling in 10 languages

02

Achieves better aesthetic quality than prior models

03

Outperforms DALL-E3 and Ideogram 1.0 in multilingual rendering

Abstract

Recently, Glyph-ByT5 has achieved highly accurate visual text rendering performance in graphic design images. However, it still focuses solely on English and performs relatively poorly in terms of visual appeal. In this work, we address these two fundamental limitations by presenting Glyph-ByT5-v2 and Glyph-SDXL-v2, which not only support accurate visual text rendering for 10 different languages but also achieve much better aesthetic quality. To achieve this, we make the following contributions: (i) creating a high-quality multilingual glyph-text and graphic design dataset consisting of more than 1 million glyph-text pairs and 10 million graphic design image-text pairs covering nine other languages, (ii) building a multilingual visual paragraph benchmark consisting of 1,000 prompts, with 100 for each language, to assess multilingual visual spelling accuracy, and (iii) leveraging the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Handwritten Text Recognition Techniques · Advanced Image and Video Retrieval Techniques