Diffusion Transformer-to-Mamba Distillation for High-Resolution Image Generation
Yuan Yao, Yicong Hong, Difan Liu, Long Mai, Feng Liu, Jiebo Luo

TL;DR
This paper introduces a distillation method to efficiently train high-resolution diffusion models by transitioning from transformer-based to linear-complexity Mamba models, enabling high-quality image generation with reduced computational costs.
Contribution
The paper proposes diffusion transformer-to-mamba distillation (T2MD), a novel training pipeline that facilitates the transition from self-attention transformers to Mamba models for high-resolution image synthesis.
Findings
Efficient training of high-resolution diffusion models via T2MD.
High-quality 2048×2048 image generation with low overhead.
Feasibility of using Mamba models for non-causal visual output.
Abstract
The quadratic computational complexity of self-attention in diffusion transformers (DiT) introduces substantial computational costs in high-resolution image generation. While the linear-complexity Mamba model emerges as a potential alternative, direct Mamba training remains empirically challenging. To address this issue, this paper introduces diffusion transformer-to-mamba distillation (T2MD), forming an efficient training pipeline that facilitates the transition from the self-attention-based transformer to the linear complexity state-space model Mamba. We establish a diffusion self-attention and Mamba hybrid model that simultaneously achieves efficiency and global dependencies. With the proposed layer-level teacher forcing and feature-based knowledge distillation, T2MD alleviates the training difficulty and high cost of a state space model from scratch. Starting from the distilled…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
