ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers
Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Chen Liang, Tong, Shen, Han Zhang, Huanzhang Dou, Yu Liu, Jingren Zhou

TL;DR
ChatDiT is a training-free, versatile framework that uses pretrained diffusion transformers for interactive, multi-modal visual tasks through natural language, outperforming specialized models without additional training.
Contribution
It introduces ChatDiT, a zero-shot, multi-agent system leveraging pretrained diffusion transformers for diverse visual generation tasks without any fine-tuning or modifications.
Findings
Outperforms specialized models on IDEA-Bench tasks
Operates effectively without additional training or tuning
Supports interactive, multi-round natural language visual tasks
Abstract
Recent research arXiv:2410.15027 arXiv:2410.23775 has highlighted the inherent in-context generation capabilities of pretrained diffusion transformers (DiTs), enabling them to seamlessly adapt to diverse visual tasks with minimal or no architectural modifications. These capabilities are unlocked by concatenating self-attention tokens across multiple input and target images, combined with grouped and masked generation pipelines. Building upon this foundation, we present ChatDiT, a zero-shot, general-purpose, and interactive visual generation framework that leverages pretrained diffusion transformers in their original form, requiring no additional tuning, adapters, or modifications. Users can interact with ChatDiT to create interleaved text-image articles, multi-page picture books, edit images, design IP derivatives, or develop character design settings, all through free-form natural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in Service Interactions · Topic Modeling · Speech and dialogue systems
MethodsDiffusion
