Ensembling Diffusion Models via Adaptive Feature Aggregation
Cong Wang, Kuan Tian, Yonghang Guan, Fei Shen, Zhiwei Jiang, Qing Gu,, Jun Zhang

TL;DR
This paper introduces Adaptive Feature Aggregation, a dynamic ensembling method for diffusion models that adjusts contributions at the feature level based on different states, improving generation quality.
Contribution
The paper proposes a novel, lightweight, trainable feature aggregator that dynamically combines multiple diffusion models' features according to various states, enhancing their collective performance.
Findings
AFA outperforms static ensembling methods in quality and diversity.
The SABW module effectively adapts model contributions across different states.
Experiments validate the efficiency and effectiveness of the proposed method.
Abstract
The success of the text-guided diffusion model has inspired the development and release of numerous powerful diffusion models within the open-source community. These models are typically fine-tuned on various expert datasets, showcasing diverse denoising capabilities. Leveraging multiple high-quality models to produce stronger generation ability is valuable, but has not been extensively studied. Existing methods primarily adopt parameter merging strategies to produce a new static model. However, they overlook the fact that the divergent denoising capabilities of the models may dynamically change across different states, such as when experiencing different prompts, initial noises, denoising steps, and spatial locations. In this paper, we propose a novel ensembling method, Adaptive Feature Aggregation (AFA), which dynamically adjusts the contributions of multiple models at the feature…
Peer Reviews
Decision·ICLR 2025 Poster
- Novel perspective on model ensembling that goes beyond both conventional parameter merging and timestep-specific expert selection strategies - Comprehensive consideration of multiple states (prompts, noise levels, timesteps, spatial locations) for adaptive feature aggregation, offering more fine-grained control than existing expert-based approaches - Well-designed empirical validation with thorough ablation studies and clear visualization of spatial attention patterns - Practical advantages in
- While the paper introduces a novel method, it lacks direct comparison with recent mixture-of-expert diffusion approaches such as ERNIE-ViLG 2.0 [a], eDiff-I [b], and MEME [c], which also address the dynamic nature of the denoising process with multiple models. Basically, I think that the reader should be informed about the ideas that can be thought of in the direction of improving DPMs using multiple models, such as using different expert models along the time axis, and how this study proposes
The paper presents a novel Adaptive Feature Aggregation (AFA) method for ensembling multiple diffusion models in text-guided image generation. 1. The paper introduces a new approach to ensembling diffusion models by dynamically adjusting the contributions of multiple models at the feature level based on various states such as prompts, initial noises, denoising steps, and spatial locations. This is a departure from existing methods that primarily adopt parameter merging strategies to produce a
1. AFA's computational efficiency is lower for individual inference steps compared to merging-based methods, primarily due to the additional parameters introduced by the Spatial-Aware Block-Wise (SABW) feature aggregator. However, the paper notes that AFA's tolerance for fewer inference steps can offset this initial inefficiency, making the overall computational cost comparable to that of base models or merging methods. 2. The performance of AFA is contingent on the quality of the base models.
1. the method can adjust the contributions of multiple diffusion models based on various states. 2. the generated attention maps can give a straightforward view of the context and timestamps in the ensembled diffusion model. 3. AFA demonstrates its tolerance to reductions in inference steps.
1. the method integrates intermediate features from U-Net denoisers with the same architecture, posing doubts on its ability to ensemble a wider range of model types or architectures. 2. As the number of base models increases, the method's reliance on multiple base models may result in overfitting if, say, the base models have correlated features.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Machine Learning and Data Classification
MethodsConcatenated Skip Connection · Max Pooling · Convolution · *Communicated@Fast*How Do I Communicate to Expedia? · U-Net · Diffusion
