Learning Stackable and Skippable LEGO Bricks for Efficient, Reconfigurable, and Variable-Resolution Diffusion Modeling
Huangjie Zheng, Zhendong Wang, Jianbo Yuan, Guanghan Ning, Pengcheng He, Quanzeng You, Hongxia Yang, Mingyuan Zhou

TL;DR
This paper introduces LEGO bricks, a flexible and efficient backbone for diffusion models that enables variable-resolution image generation, reduces computational costs, and improves training and sampling efficiency.
Contribution
The paper proposes LEGO bricks, a novel reconfigurable diffusion backbone that stacks local and global modules for adaptable, resource-efficient image synthesis.
Findings
LEGO bricks improve training efficiency and convergence speed.
They enable variable-resolution image generation.
Sampling time is significantly reduced compared to existing methods.
Abstract
Diffusion models excel at generating photo-realistic images but come with significant computational costs in both training and sampling. While various techniques address these computational challenges, a less-explored issue is designing an efficient and adaptable network backbone for iterative refinement. Current options like U-Net and Vision Transformer often rely on resource-intensive deep networks and lack the flexibility needed for generating images at variable resolutions or with a smaller network than used in training. This study introduces LEGO bricks, which seamlessly integrate Local-feature Enrichment and Global-content Orchestration. These bricks can be stacked to create a test-time reconfigurable diffusion backbone, allowing selective skipping of bricks to reduce sampling costs and generate higher-resolution images than the training data. LEGO bricks enrich local regions with…
Peer Reviews
Decision·ICLR 2024 poster
The paper is well written and generally easy to follow. To my knowledge, the idea of splitting the image into patches and to process with a hierarchy of modules has not been explored in the diffusion literature before. The skipping and mixing of modules envisioned by the authors is interesting. The method seems to train more efficiently than recent diffusion methods and has lower inference FLOPs than those at the same sample quality.
It seems the authors in essence propose a block-based hierarchical architecture which is not very different from a UNet. While there is lots of talk on modularity and skipping of blocks, these aspects are only explored in the appendix on 64x64 CelebA images, i.e. an easy data set at a resolution where inference speed ups are not very interesting. The aspect of incorporating a pretrained diffusion model is only explored as an ablation. Further, generating images larger than the training resolutio
- The proposed method appears to be original and is intuitive. - The authors present sensible experiments that ablate over several design choices of their method. Especially the choice of dropping specific LEGO blocks during inference is quite interesting. - The authors clearly state several limitations of their work, none of which are a major concern for this submission -- the paper is well-scoped. - The writing style is good, and the authors always try to simplify and add intuition to desig
- The method presentation could be improved. Intro, Section 3.1 and Figure 3 provide some high-level intuition which is helpful for the start. Meanwhile the remaining sections obfuscate major questions, e.g. whether a LEGO brick is applied densely or sparsely or how the patches are selected. - The majority of the experiments and results are placed in the appendix, while a very large portion of the main paper is dedicated on an extensive introduction, related works, and more context setting at t
* The idea of designing a run-time configurable backbone for diffusion model is interesting and timely. * The design within LEGO bricks makes sense, and the performance also looks good. * The paper is well-written with clear structure. The training and sampling details are presented in a clear way.
* It seems that the design of LEGO bricks borrows a lot from DiT, so it is a bit unclear to me how much additional contribution w.r.t. network design made in this work. * It is also clear if the idea of LEGO (skippable and stackable backbone), is specific to DiT, or it is general enough to be applied to other types of backbone for diffusion models?
Code & Models
Videos
Taxonomy
TopicsAdvanced Neuroimaging Techniques and Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Multi-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Linear Layer · Label Smoothing · Concatenated Skip Connection · Adam · Absolute Position Encodings
