ARTIST: Improving the Generation of Text-rich Images with Disentangled   Diffusion Models and Large Language Models

Jianyi Zhang; Yufan Zhou; Jiuxiang Gu; Curtis Wigington; Tong Yu,; Yiran Chen; Tong Sun; Ruiyi Zhang

arXiv:2406.12044·cs.CV·December 3, 2024

ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models and Large Language Models

Jianyi Zhang, Yufan Zhou, Jiuxiang Gu, Curtis Wigington, Tong Yu,, Yiran Chen, Tong Sun, Ruiyi Zhang

PDF

Open Access

TL;DR

The paper introduces ARTIST, a novel framework that enhances text rendering in diffusion-based image generation by disentangling text and image models and leveraging large language models to better interpret user intent, resulting in significant quality improvements.

Contribution

ARTIST presents a new disentangled diffusion architecture with a dedicated textual model and leverages large language models, improving text accuracy and interpretability in text-rich image generation.

Findings

01

Up to 15% improvement on MARIO-Eval benchmark

02

Enhanced text rendering accuracy in diffusion models

03

Effective integration of large language models for user intent understanding

Abstract

Diffusion models have demonstrated exceptional capabilities in generating a broad spectrum of visual content, yet their proficiency in rendering text is still limited: they often generate inaccurate characters or words that fail to blend well with the underlying image. To address these shortcomings, we introduce a novel framework named, ARTIST, which incorporates a dedicated textual diffusion model to focus on the learning of text structures specifically. Initially, we pretrain this textual model to capture the intricacies of text representation. Subsequently, we finetune a visual diffusion model, enabling it to assimilate textual structure information from the pretrained textual model. This disentangled architecture design and training strategy significantly enhance the text rendering ability of the diffusion models for text-rich image generation. Additionally, we leverage the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Generative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection

MethodsFocus · Diffusion