Diffusion Transformer-to-Mamba Distillation for High-Resolution Image Generation

Yuan Yao; Yicong Hong; Difan Liu; Long Mai; Feng Liu; Jiebo Luo

arXiv:2506.18999·cs.CV·June 25, 2025

Diffusion Transformer-to-Mamba Distillation for High-Resolution Image Generation

Yuan Yao, Yicong Hong, Difan Liu, Long Mai, Feng Liu, Jiebo Luo

PDF

TL;DR

This paper introduces a distillation method to efficiently train high-resolution diffusion models by transitioning from transformer-based to linear-complexity Mamba models, enabling high-quality image generation with reduced computational costs.

Contribution

The paper proposes diffusion transformer-to-mamba distillation (T2MD), a novel training pipeline that facilitates the transition from self-attention transformers to Mamba models for high-resolution image synthesis.

Findings

01

Efficient training of high-resolution diffusion models via T2MD.

02

High-quality 2048×2048 image generation with low overhead.

03

Feasibility of using Mamba models for non-causal visual output.

Abstract

The quadratic computational complexity of self-attention in diffusion transformers (DiT) introduces substantial computational costs in high-resolution image generation. While the linear-complexity Mamba model emerges as a potential alternative, direct Mamba training remains empirically challenging. To address this issue, this paper introduces diffusion transformer-to-mamba distillation (T2MD), forming an efficient training pipeline that facilitates the transition from the self-attention-based transformer to the linear complexity state-space model Mamba. We establish a diffusion self-attention and Mamba hybrid model that simultaneously achieves efficiency and global dependencies. With the proposed layer-level teacher forcing and feature-based knowledge distillation, T2MD alleviates the training difficulty and high cost of a state space model from scratch. Starting from the distilled…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.