Dimba: Transformer-Mamba Diffusion Models
Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Youqiang Zhang,, Junshi Huang

TL;DR
Dimba introduces a hybrid Transformer-Mamba diffusion model for text-to-image generation, achieving comparable quality with improved efficiency and flexibility, and revealing promising architectural properties for large-scale diffusion models.
Contribution
This work presents the first hybrid Transformer-Mamba architecture for diffusion models, combining their strengths and optimizing for large-scale, resource-constrained image generation.
Findings
Dimba achieves comparable image quality to benchmarks.
The model offers higher throughput and lower memory usage.
It demonstrates flexible adaptation to resource constraints.
Abstract
This paper unveils Dimba, a new text-to-image diffusion model that employs a distinctive hybrid architecture combining Transformer and Mamba elements. Specifically, Dimba sequentially stacked blocks alternate between Transformer and Mamba layers, and integrate conditional information through the cross-attention layer, thus capitalizing on the advantages of both architectural paradigms. We investigate several optimization strategies, including quality tuning, resolution adaption, and identify critical configurations necessary for large-scale image generation. The model's flexible design supports scenarios that cater to specific resource constraints and objectives. When scaled appropriately, Dimba offers substantial throughput and a reduced memory footprint relative to conventional pure Transformers-based benchmarks. Extensive experiments indicate that Dimba achieves comparable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVibration and Dynamic Analysis
MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Byte Pair Encoding · Label Smoothing · Diffusion · Adam · Residual Connection · Position-Wise Feed-Forward Layer
