Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis
Bingda Tang, Boyang Zheng, Xichen Pan, Sayak Paul, Saining Xie

TL;DR
This paper thoroughly explores the design space of combining large language models with diffusion transformers for text-to-image synthesis, providing empirical comparisons, analysis, and practical training guidelines.
Contribution
It offers a detailed empirical study and reproducible training recipes for deep fusion of LLMs and DiTs in multi-modal generation, addressing gaps in previous research.
Findings
Controlled comparisons with baseline methods
Analysis of key design choices
Reproducible training recipes
Abstract
This paper does not describe a new method; instead, it provides a thorough exploration of an important yet understudied design space related to recent advances in text-to-image synthesis -- specifically, the deep fusion of large language models (LLMs) and diffusion transformers (DiTs) for multi-modal generation. Previous studies mainly focused on overall system performance rather than detailed comparisons with alternative methods, and key design details and training recipes were often left undisclosed. These gaps create uncertainty about the real potential of this approach. To fill these gaps, we conduct an empirical study on text-to-image generation, performing controlled comparisons with established baselines, analyzing important design choices, and providing a clear, reproducible recipe for training at scale. We hope this work offers meaningful data points and practical guidelines…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques
MethodsDiffusion
