Dimba: Transformer-Mamba Diffusion Models

Zhengcong Fei; Mingyuan Fan; Changqian Yu; Debang Li; Youqiang Zhang,; Junshi Huang

arXiv:2406.01159·cs.CV·June 4, 2024·6 cites

Dimba: Transformer-Mamba Diffusion Models

Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Youqiang Zhang,, Junshi Huang

PDF

Open Access

TL;DR

Dimba introduces a hybrid Transformer-Mamba diffusion model for text-to-image generation, achieving comparable quality with improved efficiency and flexibility, and revealing promising architectural properties for large-scale diffusion models.

Contribution

This work presents the first hybrid Transformer-Mamba architecture for diffusion models, combining their strengths and optimizing for large-scale, resource-constrained image generation.

Findings

01

Dimba achieves comparable image quality to benchmarks.

02

The model offers higher throughput and lower memory usage.

03

It demonstrates flexible adaptation to resource constraints.

Abstract

This paper unveils Dimba, a new text-to-image diffusion model that employs a distinctive hybrid architecture combining Transformer and Mamba elements. Specifically, Dimba sequentially stacked blocks alternate between Transformer and Mamba layers, and integrate conditional information through the cross-attention layer, thus capitalizing on the advantages of both architectural paradigms. We investigate several optimization strategies, including quality tuning, resolution adaption, and identify critical configurations necessary for large-scale image generation. The model's flexible design supports scenarios that cater to specific resource constraints and objectives. When scaled appropriately, Dimba offers substantial throughput and a reduced memory footprint relative to conventional pure Transformers-based benchmarks. Extensive experiments indicate that Dimba achieves comparable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVibration and Dynamic Analysis

MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Byte Pair Encoding · Label Smoothing · Diffusion · Adam · Residual Connection · Position-Wise Feed-Forward Layer