Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis

Bingda Tang; Boyang Zheng; Xichen Pan; Sayak Paul; Saining Xie

arXiv:2505.10046·cs.CV·May 16, 2025

Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis

Bingda Tang, Boyang Zheng, Xichen Pan, Sayak Paul, Saining Xie

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper thoroughly explores the design space of combining large language models with diffusion transformers for text-to-image synthesis, providing empirical comparisons, analysis, and practical training guidelines.

Contribution

It offers a detailed empirical study and reproducible training recipes for deep fusion of LLMs and DiTs in multi-modal generation, addressing gaps in previous research.

Findings

01

Controlled comparisons with baseline methods

02

Analysis of key design choices

03

Reproducible training recipes

Abstract

This paper does not describe a new method; instead, it provides a thorough exploration of an important yet understudied design space related to recent advances in text-to-image synthesis -- specifically, the deep fusion of large language models (LLMs) and diffusion transformers (DiTs) for multi-modal generation. Previous studies mainly focused on overall system performance rather than detailed comparisons with alternative methods, and key design details and training recipes were often left undisclosed. These gaps create uncertainty about the real potential of this approach. To fill these gaps, we conduct an empirical study on text-to-image generation, performing controlled comparisons with established baselines, analyzing important design choices, and providing a clear, reproducible recipe for training at scale. We hope this work offers meaningful data points and practical guidelines…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tang-bd/fuse-dit
pytorchOfficial

Models

🤗
ooutlierr/fuse-dit
model· 8 dl· ♡ 7
8 dl♡ 7

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques

MethodsDiffusion