DiC: Rethinking Conv3x3 Designs in Diffusion Models

Yuchuan Tian; Jing Han; Chengcheng Wang; Yuchen Liang; Chao Xu; Hanting Chen

arXiv:2501.00603·cs.CV·June 10, 2025

DiC: Rethinking Conv3x3 Designs in Diffusion Models

Yuchuan Tian, Jing Han, Chengcheng Wang, Yuchen Liang, Chao Xu, Hanting Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces DiC, a convolution-based diffusion model that outperforms transformer-based models in speed and competitiveness by rethinking convolutional design and incorporating novel architectural and conditioning improvements.

Contribution

The paper proposes a purely convolutional diffusion architecture, DiC, with novel design enhancements like sparse skip connections and advanced conditioning, surpassing transformer-based models in performance and speed.

Findings

01

DiC outperforms diffusion transformers in quality.

02

DiC maintains faster inference speeds.

03

Architectural improvements significantly boost diffusion performance.

Abstract

Diffusion models have shown exceptional performance in visual generation tasks. Recently, these models have shifted from traditional U-Shaped CNN-Attention hybrid structures to fully transformer-based isotropic architectures. While these transformers exhibit strong scalability and performance, their reliance on complicated self-attention operation results in slow inference speeds. Contrary to these works, we rethink one of the simplest yet fastest module in deep learning, 3x3 Convolution, to construct a scaled-up purely convolutional diffusion model. We first discover that an Encoder-Decoder Hourglass design outperforms scalable isotropic architectures for Conv3x3, but still under-performing our expectation. Further improving the architecture, we introduce sparse skip connections to reduce redundancy and improve scalability. Based on the architecture, we introduce conditioning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuchuantian/dic
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVLSI and FPGA Design Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion · Convolution