DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling
Yuang Ai, Qihang Fan, Xuefeng Hu, Zhenheng Yang, Ran He, Huaibo Huang

TL;DR
This paper introduces DiCo, a convolution-based diffusion model that replaces self-attention with efficient convolutional modules, achieving comparable or better performance with significantly reduced computational costs.
Contribution
The paper proposes a novel convolutional diffusion model, DiCo, with a channel attention mechanism to enhance feature diversity, replacing costly self-attention in diffusion transformers.
Findings
DiCo-XL achieves an FID of 2.05 on ImageNet at 256x256.
DiCo models are 2.7x to 3.1x faster than DiT-XL/2.
Purely convolutional DiCo performs well on text-to-image tasks.
Abstract
Diffusion Transformer (DiT), a promising diffusion model for visual generation, demonstrates impressive performance but incurs significant computational overhead. Intriguingly, analysis of pre-trained DiT models reveals that global self-attention is often redundant, predominantly capturing local patterns-highlighting the potential for more efficient alternatives. In this paper, we revisit convolution as an alternative building block for constructing efficient and expressive diffusion models. However, naively replacing self-attention with convolution typically results in degraded performance. Our investigations attribute this performance gap to the higher channel redundancy in ConvNets compared to Transformers. To resolve this, we introduce a compact channel attention mechanism that promotes the activation of more diverse channels, thereby enhancing feature diversity. This leads to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Layer Normalization · Diffusion · Byte Pair Encoding · Label Smoothing · Adam · Softmax
