Remix-DiT: Mixing Diffusion Transformers for Multi-Expert Denoising
Gongfan Fang, Xinyin Ma, Xinchao Wang

TL;DR
Remix-DiT introduces a multi-expert diffusion model that uses learnable mixing of basis models to improve image generation quality efficiently, reducing training costs compared to independent expert models.
Contribution
It proposes a novel mixing approach with basis models and learnable coefficients to enhance diffusion model quality without extensive training of multiple independent experts.
Findings
Achieves improved image quality on ImageNet.
Maintains efficiency comparable to standard diffusion transformers.
Outperforms other multi-expert methods in experiments.
Abstract
Transformer-based diffusion models have achieved significant advancements across a variety of generative tasks. However, producing high-quality outputs typically necessitates large transformer models, which result in substantial training and inference overhead. In this work, we investigate an alternative approach involving multiple experts for denoising, and introduce Remix-DiT, a novel method designed to enhance output quality at a low cost. The goal of Remix-DiT is to craft N diffusion experts for different denoising timesteps, yet without the need for expensive training of N independent models. To achieve this, Remix-DiT employs K basis models (where K < N) and utilizes learnable mixing coefficients to adaptively craft expert models. This design offers two significant advantages: first, although the total model size is increased, the model produced by the mixing operation shares the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Signal Denoising Methods
MethodsDiffusion
