DiMSUM: Diffusion Mamba -- A Scalable and Unified Spatial-Frequency   Method for Image Generation

Hao Phung; Quan Dao; Trung Dao; Hoang Phan; Dimitris Metaxas; Anh Tran

arXiv:2411.04168·cs.CV·April 14, 2025

DiMSUM: Diffusion Mamba -- A Scalable and Unified Spatial-Frequency Method for Image Generation

Hao Phung, Quan Dao, Trung Dao, Hoang Phan, Dimitris Metaxas, Anh Tran

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

DiMSUM introduces a unified spatial-frequency diffusion model that leverages wavelet transformations and a globally-shared transformer to improve image generation quality and training efficiency.

Contribution

The paper proposes a novel state-space architecture integrating wavelet transforms and a global transformer to enhance local and global feature capture in image generation.

Findings

01

Outperforms DiT and DIFFUSSM on standard benchmarks.

02

Achieves faster training convergence.

03

Produces higher quality images.

Abstract

We introduce a novel state-space architecture for diffusion models, effectively harnessing spatial and frequency information to enhance the inductive bias towards local features in input images for image generation tasks. While state-space networks, including Mamba, a revolutionary advancement in recurrent neural networks, typically scan input sequences from left to right, they face difficulties in designing effective scanning strategies, especially in the processing of image data. Our method demonstrates that integrating wavelet transformation into Mamba enhances the local structure awareness of visual inputs and better captures long-range relations of frequencies by disentangling them into wavelet subbands, representing both low- and high-frequency components. These wavelet-based outputs are then processed and seamlessly fused with the original Mamba outputs through a cross-attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vinairesearch/dimsum
pytorchOfficial

Models

🤗
haopt/dimsum-L2-imagenet256
model· ♡ 2
♡ 2

Videos

DiMSUM: Diffusion Mamba - A Scalable and Unified Spatial-Frequency Method for Image Generation· slideslive

Taxonomy

TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques

MethodsDiffusion · Mamba: Linear-Time Sequence Modeling with Selective State Spaces