JoyType: A Robust Design for Multilingual Visual Text Creation
Chao Li, Chen Jiang, Xiaolong Liu, Jun Zhao, Guoxin Wang

TL;DR
JoyType is a new multilingual visual text creation model that maintains font styles, including small fonts, during image generation by using a specialized control network and OCR-aware loss, outperforming existing methods.
Contribution
Introduces JoyType, a novel approach with a large dataset and a font control network, enhancing font style preservation in diffusion-based image generation.
Findings
Outperforms state-of-the-art methods in font style accuracy
Effectively generates small-font text in images
Can be integrated as a plugin with other diffusion models
Abstract
Generating images with accurately represented text, especially in non-Latin languages, poses a significant challenge for diffusion models. Existing approaches, such as the integration of hint condition diagrams via auxiliary networks (e.g., ControlNet), have made strides towards addressing this issue. However, diffusion models often fall short in tasks requiring controlled text generation, such as specifying particular fonts or producing text in small fonts. In this paper, we introduce a novel approach for multilingual visual text creation, named JoyType, designed to maintain the font style of text during the image generation process. Our methodology begins with assembling a training dataset, JoyType-1M, comprising 1 million pairs of data. Each pair includes an image, its description, and glyph instructions corresponding to the font style within the image. We then developed a text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Storytelling and Education
MethodsDiffusion · Hierarchical Information Threading
