ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting   with Diffusion Transformers

Lianghua Huang; Wei Wang; Zhi-Fan Wu; Yupeng Shi; Chen Liang; Tong; Shen; Han Zhang; Huanzhang Dou; Yu Liu; Jingren Zhou

arXiv:2412.12571·cs.CV·December 18, 2024

ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers

Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Chen Liang, Tong, Shen, Han Zhang, Huanzhang Dou, Yu Liu, Jingren Zhou

PDF

Open Access 1 Repo

TL;DR

ChatDiT is a training-free, versatile framework that uses pretrained diffusion transformers for interactive, multi-modal visual tasks through natural language, outperforming specialized models without additional training.

Contribution

It introduces ChatDiT, a zero-shot, multi-agent system leveraging pretrained diffusion transformers for diverse visual generation tasks without any fine-tuning or modifications.

Findings

01

Outperforms specialized models on IDEA-Bench tasks

02

Operates effectively without additional training or tuning

03

Supports interactive, multi-round natural language visual tasks

Abstract

Recent research arXiv:2410.15027 arXiv:2410.23775 has highlighted the inherent in-context generation capabilities of pretrained diffusion transformers (DiTs), enabling them to seamlessly adapt to diverse visual tasks with minimal or no architectural modifications. These capabilities are unlocked by concatenating self-attention tokens across multiple input and target images, combined with grouped and masked generation pipelines. Building upon this foundation, we present ChatDiT, a zero-shot, general-purpose, and interactive visual generation framework that leverages pretrained diffusion transformers in their original form, requiring no additional tuning, adapters, or modifications. Users can interact with ChatDiT to create interleaved text-image articles, multi-page picture books, edit images, design IP derivatives, or develop character design settings, all through free-form natural…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ali-vilab/chatdit
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI in Service Interactions · Topic Modeling · Speech and dialogue systems

MethodsDiffusion