ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models and Large Language Models
Jianyi Zhang, Yufan Zhou, Jiuxiang Gu, Curtis Wigington, Tong Yu,, Yiran Chen, Tong Sun, Ruiyi Zhang

TL;DR
The paper introduces ARTIST, a novel framework that enhances text rendering in diffusion-based image generation by disentangling text and image models and leveraging large language models to better interpret user intent, resulting in significant quality improvements.
Contribution
ARTIST presents a new disentangled diffusion architecture with a dedicated textual model and leverages large language models, improving text accuracy and interpretability in text-rich image generation.
Findings
Up to 15% improvement on MARIO-Eval benchmark
Enhanced text rendering accuracy in diffusion models
Effective integration of large language models for user intent understanding
Abstract
Diffusion models have demonstrated exceptional capabilities in generating a broad spectrum of visual content, yet their proficiency in rendering text is still limited: they often generate inaccurate characters or words that fail to blend well with the underlying image. To address these shortcomings, we introduce a novel framework named, ARTIST, which incorporates a dedicated textual diffusion model to focus on the learning of text structures specifically. Initially, we pretrain this textual model to capture the intricacies of text representation. Subsequently, we finetune a visual diffusion model, enabling it to assimilate textual structure information from the pretrained textual model. This disentangled architecture design and training strategy significantly enhance the text rendering ability of the diffusion models for text-rich image generation. Additionally, we leverage the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Generative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection
MethodsFocus · Diffusion
