Modulating Pretrained Diffusion Models for Multimodal Image Synthesis

Cusuh Ham; James Hays; Jingwan Lu; Krishna Kumar Singh; Zhifei Zhang,; Tobias Hinz

arXiv:2302.12764·cs.CV·May 22, 2023

Modulating Pretrained Diffusion Models for Multimodal Image Synthesis

Cusuh Ham, James Hays, Jingwan Lu, Krishna Kumar Singh, Zhifei Zhang,, Tobias Hinz

PDF

Open Access

TL;DR

This paper introduces multimodal conditioning modules (MCM) that enable control over image synthesis in pretrained diffusion models without updating their parameters, allowing for efficient, flexible, and precise multimodal image generation.

Contribution

The paper proposes a novel, lightweight module that modulates pretrained diffusion models for multimodal conditioning without fine-tuning the entire network.

Findings

01

MCM enables spatial control over generated images.

02

Training MCM is computationally inexpensive and requires few examples.

03

MCM improves alignment between generated images and conditioning inputs.

Abstract

We present multimodal conditioning modules (MCM) for enabling conditional image synthesis using pretrained diffusion models. Previous multimodal synthesis works rely on training networks from scratch or fine-tuning pretrained networks, both of which are computationally expensive for large, state-of-the-art diffusion models. Our method uses pretrained networks but \textit{does not require any updates to the diffusion network's parameters}. MCM is a small module trained to modulate the diffusion network's predictions during sampling using 2D modalities (e.g., semantic segmentation maps, sketches) that were unseen during the original training of the diffusion model. We show that MCM enables user control over the spatial layout of the image and leads to increased control over the image generation process. Training MCM is cheap as it does not require gradients from the original diffusion…

Tables1

Table 1. Table 5 . MCM hyperparameters for all experiments.

# parameters	3.9M
Channels	32
Channel multiplier	1,1,2,4
# residual blocks	2
Attention resolutions	16
Batch size	64
Epochs	10k
Learning rate	1e-5

Equations12

q (x_{t} ∣ x_{0}) = \overline{α}_{t} x_{0} + 1 - \overline{α}_{t} ϵ,

q (x_{t} ∣ x_{0}) = \overline{α}_{t} x_{0} + 1 - \overline{α}_{t} ϵ,

x_{0}^{'} = \frac{x _{t} - 1 - α _{t} ϵ _{t}}{α _{t}},

x_{0}^{'} = \frac{x _{t} - 1 - α _{t} ϵ _{t}}{α _{t}},

x_{t - 1} = \overline{α}_{t - 1} \cdot x_{0}^{'} + 1 - \overline{α}_{t - 1} - σ_{t}^{2} \cdot ϵ_{t} + σ_{t} ϵ,

x_{t - 1} = \overline{α}_{t - 1} \cdot x_{0}^{'} + 1 - \overline{α}_{t - 1} - σ_{t}^{2} \cdot ϵ_{t} + σ_{t} ϵ,

L_{MSE} = MSE (x_{0}^{'}, x_{0}),

L_{MSE} = MSE (x_{0}^{'}, x_{0}),

L_{1} = L_{1} (γ) + L_{1} (ν) .

L_{1} = L_{1} (γ) + L_{1} (ν) .

L_{M C M} = λ_{x} L_{MSE} + λ_{1} L_{1},

L_{M C M} = λ_{x} L_{MSE} + λ_{1} L_{1},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning

MethodsDiffusion · Balanced Selection

Full text

Modulating Pretrained Diffusion Models for Multimodal Image Synthesis

Cusuh Ham

[email protected]

Georgia Institute of TechnologyUSA

,

James Hays

Georgia Institute of TechnologyUSA

[email protected]

,

Jingwan Lu

Adobe ResearchUSA

[email protected]

,

Krishna Kumar Singh

Adobe ResearchUSA

[email protected]

,

Zhifei Zhang

Adobe ResearchUSA

[email protected]

and

Tobias Hinz

Adobe ResearchUSA

[email protected]

Abstract.

We present multimodal conditioning modules (MCM) for enabling conditional image synthesis using pretrained diffusion models. Previous multimodal synthesis works rely on training networks from scratch or fine-tuning pretrained networks, both of which are computationally expensive for large, state-of-the-art diffusion models. Our method uses pretrained networks but does not require any updates to the diffusion network’s parameters. MCM is a small module trained to modulate the diffusion network’s predictions during sampling using 2D modalities (e.g., semantic segmentation maps, sketches) that were unseen during the original training of the diffusion model. We show that MCM enables user control over the spatial layout of the image and leads to increased control over the image generation process. Training MCM is cheap as it does not require gradients from the original diffusion net, consists of only $\sim$ 1% of the number of parameters of the base diffusion model, and is trained using only a limited number of training examples. We evaluate our method on unconditional and text-conditional models to demonstrate the improved control over the generated images and their alignment with respect to the conditioning inputs. 00footnotetext: https://mcm-diffusion.github.io

1. Introduction

Diffusion models have shown great potential in generating high-quality images that are realistic and diverse. However, current models rely heavily on large amounts of training data and are usually unconditional or only conditioned on more abstract conditions such as text (Saharia et al., 2022b; Rombach et al., 2022; Ramesh et al., 2022). The process of training these models is expensive and requires a large amount of computational resources. The reliance on vast amounts of training data limits the models’ applicability when less data is available, as is the case for many conditional generation tasks. While there exist some large datasets for text-conditional image synthesis (Schuhmann et al., 2022), datasets for more controlled image synthesis, such as conditioning on segmentation maps, are orders of magnitudes smaller (Kuznetsova et al., 2020; Benenson et al., 2019; Lin et al., 2014).

Many approaches try to address these limitations by fine-tuning a pretrained model for a specific domain (Ruiz et al., 2022; Kawar et al., 2022) or to accept additional conditioning modalities such as segmentation maps or sketches (Xie et al., 2022). However, this requires access to the model parameters and significant computational resources as gradients have to be calculated for the full model. Furthermore, fine-tuning a full model limits the applicability since the models are large and it can be difficult to easily share them. Thus, this approach does not scale since a new full-sized diffusion model is required for each new domain or combination of modalities. Another challenge with fine-tuning models is that they quickly overfit to the smaller subset of data that they are fine-tuned on.

Training models conditioned on the chosen modality from scratch (Zhang et al., 2021; Wu et al., 2022; Gafni et al., 2022; Huang et al., 2022) is limited by the available training data, reduces diversity, and diminishes the applicability of the trained model. Additionally, if the model needs to be conditioned on another modality, it needs to be retrained. A pretrained model can also be guided towards a desired direction at test time, e.g., by using gradients from a pretrained classifier or CLIP network (Liu et al., 2023). However, this approach slows down the sampling a the gradients must be calculated on the fly and optimized per sample.

Our approach addresses these limitations by introducing a novel method for multimodal conditional image synthesis using pretrained diffusion networks without changing any parameters of or requiring any gradients from the diffusion network itself. This means that the diffusion network can be treated as a black box and can even be accessed remotely, as the only data our approach needs are the predictions of the diffusion net for each sampling timestep. To achieve this, we train a small diffusion-like network conditioned on new modalities to modulate the original diffusion network’s predictions at each sampling timestep so that the generated image follows the provided conditioning. The modulating network is the only model that is trained while the original diffusion network stays frozen, ensuring that the original diffusion network’s high quality and diversity are preserved while also allowing for specific and tailored conditional image generation.

Our approach is computationally efficient as it requires fewer computational resources than training a diffusion net from scratch or fine-tuning an existing diffusion net. This is due to the small size of the modulating network and the lack of need to calculate gradients for the large diffusion net. Our approach generalizes well even when using only a small amount of training data. Other approaches such as fine-tuning, on the other hand, need much more training data or quickly suffer from overfitting. At test time, our approach does not slow down the sampling process since no gradients need to be calculated and the only computational overhead comes from running the small diffusion net, which is negligible compared to running the large diffusion net.

Figure 1 shows results of combining our multimodal conditioning module (MCM) with Stable Diffusion (SD) (Rombach et al., 2022) which is originally only conditioned on text. Incorporating our MCM adds more control to the image generation by being able to condition on additional modalities such as a segmentation map or a sketch on top of the existing text condition. A single trained MCM is able to handle different input conditions, e.g., in this case only a segmentation map, only a sketch, or both. While the text specifies a rough layout of the image, the additional modalities allow for much more fine-grained control over the generation process. In the case of SD, the network predicts the noise for each sampling time step conditioned on the text and them MCM modulates the noise prediction based on the new conditions.

Our main contribution is the introduction of multimodal conditioning modules (MCM), a method for adapting pretrained diffusion models for conditional image synthesis without changing the original model’s parameters. MCM is a small network trained on limited paired examples of the target modalities to modulate the output of the diffusion model during sampling. Our MCMs are roughly 100 times smaller than the original diffusion models and even when training on only a few thousand labeled examples our method obtains high-quality and diverse results while being cheaper and using less memory than training from scratch or fine-tuning a large model.

2. Related Work

Conditional image synthesis. Previous studies on conditional image synthesis have explored GANs to bridge two statistically distinct domains, such as mapping sketches or segmentation maps into photo-realistic images. One notable example is the StyleGAN series (Karras et al., 2019, 2020b, 2020a, 2021; Sauer et al., 2022), which has served as a source of inspiration for many other conditional generation works. Another is the pix2pix series (Isola et al., 2017), including works such as (Park et al., 2019; Richardson et al., 2021; Sushko et al., 2022). The introduction of transformers (Esser et al., 2021) has further enhanced the visual quality of generated images.

Recently, diffusion models have emerged as an alternative to GANs and transformers, showing increased image quality and alignment with textual conditions (Nichol and Dhariwal, 2021; Balaji et al., 2022; Feng et al., 2022). These models have made significant advancements in text-to-image generation, with DALL-E2 (Ramesh et al., 2022) proposing a framework using CLIP latent based on its previous works like GLIDE (Nichol et al., 2021) and Guided-diffusion (Dhariwal and Nichol, 2021). Latent diffusion models (LDM) (Rombach et al., 2022) learn operate in the latent space of an image autoencoder, showing strong adaptability and superior quality for tasks such as segmentation-conditioned image synthesis, image super-resolution, and image inpainting. Imagen (Saharia et al., 2022b) uses a pyramid approach to generate high-quality images in the pixel space, marking a breakthrough in pixel-based diffusion models. There are also other pioneer works: SDEdit (Meng et al., 2022) proposes a stochastic differential equation during the sampling process for image editing, Diff-AE (Preechakul et al., 2022) conducts attribute interpolation using a diffusion model, and SR3 (Saharia et al., 2022c), Palette (Saharia et al., 2022a), PITI (Wang et al., 2022), and Plug-and-Play (Tumanyan et al., 2022) propose various methods for image-to-image translation using diffusion models. Additionally, ControlNet (Zhang and Agrawala, 2023), T2I (Mou et al., 2023), and Latent Edge Predictor (Voynov et al., 2022) are concurrent work that add new conditioning modalities to pretrained diffusion models.

Multimodal conditional image synthesis. Multimodal conditional image synthesis is a technique that uses multiple conditions from various modalities, such as masks, sketches, and language, to generate images. PoE-GAN (Huang et al., 2022) uses product-of-experts GANs to synthesize images based on any subset of multiple modalities, including an empty set. Make-A-Scene (Gafni et al., 2022) utilizes the transformer to tokenize domain-specific knowledge and adapts classifier-free guidance for the transformer use case. It accepts text and scene layouts for image synthesis. M6-UFC (Zhang et al., 2021) also leverages the transformer and can unify any number of multi-modal controls, where both the control signals and the synthesized image are represented as a sequence of discrete tokens.

In diffusion-based methods, eDiff-I (Balaji et al., 2022) utilizes multiple encoders, i.e., both T5 and CLIP encoders, in the diffusion model to handle text, image, and layout conditions. SpaText (Avrahami et al., 2022) introduces spatio-textual representations to condition on text and semantic layouts. SDG (Liu et al., 2023) proposes a unified framework for semantic diffusion guidance, which allows for either language or image guidance, or both. Additionally, Composer (Huang et al., 2023) is concurrent work that conditions the diffusion net on global (e.g., text) and local (e.g., edgemaps) modalities. The main difference between these approaches and ours is that all of them are trained from scratch on different conditioning modalities. In contrast, MCM adds new conditioning modalities to an existing model without having to retrain or fine-tune the underlying generative model itself.

3. Approach

We propose the multimodal conditioning module (MCM), which aims to inject user control into pretrained diffusion models using a set of modalities originally unseen during training. MCM is a small module that is trained using limited paired examples to modulate the diffusion denoising process. We highlight several advantages of our approach: MCM 1) does not update the parameters of the diffusion model, 2) can easily be expanded to incorporate additional modalities through concatenation, 3) does not require individual modality encoders, and 4) can be applied to unconditional and conditional diffusion models. In this section, we establish notation with a brief overview of diffusion models and describe our proposed method.

3.1. Diffusion models

A diffusion model (Sohl-Dickstein et al., 2015; Ho et al., 2020) is trained with a defined variance schedule $\{\beta_{t}\}_{t=1}^{T}$ across $T$ timesteps. The forward noising process for an input $x_{0}\in\mathbb{R}^{H\times W\times 3}$ is a fixed computation defined as:

[TABLE]

where $\epsilon\sim N(0,I)$ , $\alpha_{t}=1-\beta_{t}$ , and $\overline{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{i}$ .

The reverse denoising process $p_{\theta}(x_{t},t)$ is trained to predict the noise $\epsilon_{t}$ added in the forward process at timestep $t$ . For diffusion models already conditioned on a given modality $y^{\ast}$ , we use its respective encoder $\tau(\cdot)$ to encode $y^{\ast}$ and feed it as additional input to the reverse process, i.e., $p_{\theta}\big{(}x_{t},t,\tau(y^{\ast})\big{)}$ . The reverse process is trained by optimizing the mean-squared error (MSE) between the predicted noise $\epsilon_{t}$ and $\epsilon\sim N(0,I)$ .

Given a noisy sample $x_{t}$ and the predicted $\epsilon_{t}$ , the fully denoised sample $x_{0}^{\prime}$ can then be approximated by:

[TABLE]

and the next denoised timestep $x_{t-1}$ can be computed using various sampling methods, such as the DDIM (Song et al., 2020) formulation:

[TABLE]

where $\sigma_{t}=\eta\sqrt{(1-\overline{\alpha}_{t-1})/(1-\overline{\alpha}_{t})}\sqrt{1-\overline{\alpha}_{t}/\overline{\alpha}_{t-1}}$ , $\eta\in\mathbb{R}_{\geq 0}$ is a hyperparameter, and $\epsilon\sim N(0,I)$ . When $\eta=0.0$ , Equation 3 becomes a deterministic process, which can lead to more efficient sampling using fewer timesteps.

Diffusion models can learn to either denoise $x_{t}$ into an RGB image directly or can work in the latent space of an autoencoder (Rombach et al., 2022). In this work, we apply our approach to LDMs as they allow for easier high-resolution image outputs and have better publicly available models. In the case of LDMs, a pretrained autoencoder $A=\{E,D\}$ , consisting of an encoder $E$ and decoder $D$ , is used to first encode $x$ into its latent representation $z=E(x)\in\mathbb{R}^{h\times w\times d}$ . The $x_{i}$ ’s can be directly replaced with the respective $z_{i}$ in the above equations, and the predicted denoised latent calculated in Equation 2 can be decoded into an image $x_{0}^{\prime}=D(z_{0}^{\prime})$ .

3.2. Modulating pretrained diffusion models

Given a paired dataset $\{(x,y_{i},...,y_{n})\}$ of images $x$ and $n$ target modalities $\{y_{i}\}_{i=1}^{n}$ , we train MCM, a small network that enables a pretrained diffusion model to condition its outputs on the $y_{i}$ ’s. We inject the guidance into the denoising process by using MCM to modulate the predicted noise map $\epsilon_{t}$ of the diffusion model at each timestep. By modulating an intermediate variable that is used to compute the next timestep $x_{t-1}$ rather than $x_{t-1}$ directly, we are not limited to using a specific sampling technique at inference.

We visualize the MCM modulation pipeline in Figure 2. Similar to a standard diffusion model training step, we take an input image $x_{0}$ . sample a random timestep $t\sim\text{Uniform}(1,T)$ , compute the noised image $x_{t}=q(x_{t}|x_{0})$ , and get the predicted noise map $\epsilon_{t}=p_{\theta}(x_{t},t)$ . Given the modalities $(y_{1},...,y_{n})$ corresponding to $x$ , we concatenate $\{x_{t},\epsilon_{t},y_{1},...,y_{n}\}$ as input with the timestep $t$ to MCM, which outputs a set of parameters, $\{\gamma_{t},\nu_{t}\}=\text{MCM}\big{(}\{x_{t},\epsilon_{t},y_{1},...,y_{n}\},t\big{)}$ . We use $\gamma_{t}$ and $\nu_{t}$ to modulate the predicted noise as $\epsilon_{t}^{\prime}=\epsilon_{t}\otimes(1+\gamma_{t})\oplus\nu_{t}$ . The use of spatial modulation parameters is inspired by SPADE (Park et al., 2019), which was originally proposed for predicting modulation parameters for normalization layers to better retain semantic information for conditional image synthesis.

We substitute $\epsilon_{t}$ with $\epsilon_{t}^{\prime}$ in Equation 2 to compute the predicted modulated denoised image $x_{0}^{\prime}$ , for which we want to adhere to the constraints specified by the $y_{i}$ ’s. The loss is defined as:

[TABLE]

where $\text{MSE}(x_{0}^{\prime},x_{0})$ is the mean-squared error between the modulated denoised image $x_{0}^{\prime}$ and the ground truth image $x_{0}$ . For LDMs, we avoid calculating and storing gradients through the decoder $D$ by applying Equation 4 between the predicted modulated denoised latent representation $z_{0}^{\prime}$ and ground truth latent $z_{0}$ .

We also apply $L_{1}$ -regularization over the modulation parameters $\gamma$ and $\nu$ to encourage $C$ to learn minimal perturbations to $\epsilon_{t}$ :

[TABLE]

Thus, the final training objective is defined as:

[TABLE]

where $\lambda$ is a scalar weighting term.

We apply the modality dropout technique (Huang et al., 2022), where, with probabilities $\{p_{i}\}_{i=1}^{n}$ , we replace the respective modality $y_{i}$ with -1’s during training. At test time, MCM is able to predict modulation parameters even in the absence of one or more modalities, avoiding heavy reliance on a single modality.

4. Experiments

In this section, we describe our experimental setup and evaluation protocols, and present qualitative and quantitative results for MCM. We primarily focus on the addition of sketches and semantic segmentation maps to latent diffusion models (LDMs) (Rombach et al., 2022) due to availability of data and public model checkpoints.

Network architectures and training details. We leverage pretrained unconditional and text-conditioned LDMs as our base models: two unconditional LDMs trained on CelebA (Liu et al., 2015) and Mountains (Park et al., 2020) and text-conditioned Stable Diffusion v2.1 (SD) (Rombach et al., 2022) trained on a subset of LAION-5B (Schuhmann et al., 2022). The two unconditional LDMs produce $256\times 256$ resolution images, and the text-conditioned SD produces $512\times 512$ resolution images. We use the publicly available non-EMA weights for the CelebA and SD models, while we trained the Mountains LDM from scratch (no public checkpoints for an unconditional model were available). Experiments with MCM applied to a pixel-based diffusion model can be found in Appendix E.

We use a time-conditional U-Net (Ronneberger et al., 2015) to output the modulation parameters, $\gamma$ and $\nu$ . We train one MCM per dataset/model combination, with its number of parameters totalling $\sim$ 1% of the unconditional LDMs and $\sim$ 0.4% of SD. We use $\lambda_{x}=1$ for unconditional LDMs and $\lambda_{x}=10$ for SD. Specific architecture and training details can be found in Appendix A.

Datasets. We evaluate the performance of MCM one two datasets: MM-CelebA-HQ (Xia et al., 2021; Liu et al., 2015; Karras et al., 2018; Lee et al., 2020) and Flickr Mountains (Park et al., 2020). MM-CelebA-HQ contains segmentation maps, sketches, and captions for 30,000 images of celebrity faces, of which $\sim$ 6,000 are designated test images. Flickr Mountains contains 500,000 mountain images scraped from Flickr with $\sim$ 6,000 test images. Because it does not contain any other corresponding modalities, we use the same pipeline used by PoE-GAN (Huang et al., 2022) to produce pseudo-ground-truth segmentation maps and sketches: we use DeepLab-v2 (Chen et al., 2017) to generate segmentation maps, and HED (Xie and Tu, 2015) with sketch simplification (Simo-Serra et al., 2016) to generate sketches. We also use BLIP (Li et al., 2022) to generate captions for Mountains for SD experiments. Collecting paired multimodal data at the scale required to train state-of-the-art conditional generative models can be difficult and expensive. Thus, by default, we only use a randomly sampled subset of 5,000 training examples for our experiments to highlight the efficacy and application of our approach under constrained settings. We provide comparisons to MCM trained with the full CelebA dataset to quantify the effect of the amount of training data. We use the full test sets for evaluations, and visualize results generated using held-out test examples of the modalities as inputs.

Evaluation metrics. We use Fréchet Inception Distance (FID) (Heusel et al., 2017) and Learned Perceptual Image Patch Similarity (LPIPS) (Zhang et al., 2018) to evaluate image quality and diversity. For each set of input modalities, we sample two images and compute the LPIPS between the two, averaged across the test set.

FID and LPIPS are qualitative metrics – we emphasize that neither metric quantifies the alignment of the generated image to its respective conditioning inputs. However, other works on multimodal conditional synthesis only report values using qualitative metrics. We propose to use metrics from related work on conditional image editing (Liu et al., 2022b) to quantify the alignment of the generated image to the conditioning inputs: 1) mean intersection over union (mIoU), 2) segmentation accuracy, and 3) sketch distance (Ham et al., 2022). For the two segmentation alignment metrics (mIoU and accuracy), we leverage a pretrained BiSeNet (Yu et al., 2018) for CelebA, and the same DeepLab-v2 (Chen et al., 2017) network used to generate pseudo-ground-truth segmentation maps for Mountains.

Baselines. Since MCM does not modify the pretrained diffusion models weights other recent GAN- or diffusion-based approaches to multimodal conditional synthesis that are trained from scratch, such as PoE-GAN (Huang et al., 2022) and Make-A-Scene (Gafni et al., 2022), are not directly applicable as baselines. Additionally, many of these approaches do not use the same modalities explored in this work, and do not have publicly released code.

Most similar to our experimental setup is SDG (Liu et al., 2023), which leverages gradients from “guidance” networks for each modality to optimize each sample at test time, thus requiring a forward pass through each network at every sampling step which slows down the sampling speed. While SDG does not update the parameters of the diffusion model, the guidance networks requires fine-tuning on noisy data in order to produce meaningful gradients for the initial timesteps during sampling. Additionally, SDG was proposed for pixel-based diffusion models, but can be adapted to LDMs by performing a forward pass through the decoder at each sampling step. We omit comparisons to SDG due to memory constraints presented by the additional step through the decoder, reliance on the guidance networks, and slow sampling speeds.

Instead, we compare against fine-tuning the diffusion model directly. We expand the input channels of the first convolutional layer of the pretrained LDMs to accommodate for the additional modalities, and train using the same settings as MCM. Since we want to enable all combinations of inputs, we adjust the modality dropout rates for the fine-tuning models to $p_{seg}=p_{sketch}=p_{\{seg,sketch\}}=0.25$ . We report all metrics on unconditional samples for reference, where the unconditional outputs for MCM are directly from the original diffusion model. We use DDIM sampling (Song et al., 2020) with $N=200$ steps and $\eta=0.0$ for all methods and evaluations, and an unconditional guidance scale of 5.0 for SD. We also include evaluations against publicly available checkpoints for segmentation- and sketch-conditioned pSp (Richardson et al., 2021), a StyleGAN (Karras et al., 2019) encoder-based method, and to a multimodal variant of concurrent work, ControlNet (Zhang and Agrawala, 2023).

Results. We visualize the magnitude and effects of the modulation parameters on the predicted denoised images $x_{0}^{\prime}$ during sampling using the same input noise map $z_{T}$ in Figure 3. The unconditional predictions $x_{0}$ are ambiguous at larger values of $t$ , whereas MCM outputs parameters that enforce more structure into $x_{0}^{\prime}$ early on so that the final image will adhere to the inputs. The magnitude of the modulation parameters is the greater at larger values of $t$ , peaking towards $t=\frac{N}{2}$ , and then quickly decreasing since the remaining steps are mostly responsible for adding high frequency details to the image (Choi et al., 2022).

We show that MCM provides better control overall than the fine-tuning baseline in terms of balancing control with quality and diversity using only a small number of training examples. In Table 1, we observe that MCM has a relatively small drop in quality and diversity from the base LDM compared to fine-tuning, and even improves the quality of the generated images for Mountains. While we expect a drop in diversity to accommodate for the constraints defined by the conditioning signals, we show that MCM is able to generate consistent yet distinct images (see Figures 4 and 5). Meanwhile, fine-tuning is susceptible to overfitting to a small training set, producing blurrier and less diverse images for all input combinations. We visualize more MCM examples in Figure 7. ControlNet performs comparably to MCM but trains a much larger model ( $\sim$ 50% of the size of the original diffusion model) and needs access to the original models’ parameters.

Table 2 shows the alignment metrics for our method compared to all baselines. Compared to the base LDM we observe increased alignment between the inputs and generated images from MCM, fine-tuning, and ControlNet. Fine-tuning tends to produce blurrier images where the distinction between classes are unclear, which may account for the worse sketch alignment. Additionally, the identities of the faces generated by the fine-tuned CelebA model using the same inputs tend to be almost identical, relying mainly instead on illumination changes to produce “diversity” among the images. Thus, the greater alignment with fine-tuning comes at the expense of diversity (see Figures 8 and 9). ControlNet achieves similar alignment scores and quality as MCM.

We observe more difficulty with segmentation alignment on Mountains for MCM and fine-tuning. Unlike CelebA, where ground truth annotations are provided, the segmentation maps for Mountains are generated using an off-the-shelf network and span a larger number of classes (182 compared to CelebA’s 19). Thus, both methods suffer from using poorer quality annotations as ground truth, but we believe that both would improve with better data.

Ablation study. We perform an ablation study using CelebA to evaluate the effect of the $L_{1}$ regularization term (Equation 5) and using limited training data. Results are shown in Tables 3 and 4. We show that only using the MSE term for the training objective demonstrates similar behavior to the fine-tuning baseline–alignment improves while the overall quality and diversity of the images suffers. Thus, $L_{1}$ regularization of the modulation parameters helps balance the quality/consistency trade-off.

We compare MCM trained with a random subset of 5,000 examples against another trained with the full CelebA dataset. We train both modules with the same number of iterations so they observe the same number of training examples. Additional training data shows a similar pattern in the evaluation metrics to the addition of the $L_{1}$ term. The alignment metrics for MCM trained with the full dataset is likely to benefit from additional training time since there are more variations and more examples of the less common classes to learn. We also compare how ablating the amount of training data affects the quality of ControlNet and MCM in Appendix D.

5. Conclusion

We introduce MCM, a novel method for multimodal image synthesis with diffusion models. Previous approaches to conditional image synthesis primarily rely on training from scratch or fine-tuning using large amounts of data and computational resources, which can be difficult or even infeasible. We avoid this by taking a pretrained diffusion model, freezing its weights, and only training a small module using a limited number of paired examples of new target modalities to modulate the sampling process. We evaluate our method using standard quality assessment metrics as well as alignment metrics to show that we are able to effectively incorporate user control while retaining high image quality.

Limitations. While our approach is able to efficiently apply multimodal control to pretrained diffusion models, MCM is currently limited to 2D modalities. We leave the incorporation of 1D modalities for future work, but show that MCM can be applied to text-conditioned models such as Stable Diffusion. Our approach can be more sensitive to the starting noise map $z_{T}$ , and struggles with grounding semantics into class labels when the training data quality is poor. Additionally, MCM is limited to more structured domains.

Appendix A Network architecture and training details

We use a time-conditional U-Net (Ronneberger et al., 2015) for MCM and substitute the last convolutional layer with a split head, where one head outputs the multiplicative modulation parameter $\gamma$ and the other outputs the additive parameter $\nu$ . The total number of parameters of MCM is $\sim$ 1% of the unconditional LDMs and $\sim$ 0.4% of SD. We use $\lambda_{x}=1$ for unconditional LDMs and $\lambda_{x}=10$ for SD. For all experiments, we use modality dropout rates $p_{seg}=p_{sketch}=0.33$ , and weighting term $\lambda_{1}=\frac{1}{b\times h\times w\times c}$ in Equation 6, where $b$ is the batch size, and $(h,w,c)$ are the dimensions of the latent representations $z$ . For SD, we randomly sample the latent vector from the KL-regularized autoencoder and modulate the output of the diffusion model without classifier-free guidance.

MCM takes in concatenated inputs $\{x_{t},\epsilon_{t},y_{1},...,y_{n}\}$ and timestep $t$ , where $x_{t},\epsilon_{t}\in\mathbb{R}^{h\times w\times c}$ and $y_{i}\in\mathbb{R}^{h\times w\times c_{y_{i}}}$ (for sketches and segmentation maps, $c_{y_{1}}=c_{y_{2}}=1$ ). The last convolutional layer is replaced with a zero-initialized split head (one for outputting $\gamma$ and the other for $\nu$ , where $\gamma,\nu\in\mathbb{R}^{h\times w\times c}$ ). For all experiments, MCM is trained with the hyperparameters described in Table 5 on NVIDIA A100s. We train 3 MCMs–one MCM per diffusion model/dataset combination (unconditional CelebA LDM, unconditional Mountains LDM, text-conditioned Mountains SD).

Appendix B MCM with Stable Diffusion

We provide additional examples of applying MCM to Stable Diffusion v2.1 using DDIM with 200 steps and an unconditional guidance scale of 5.0 in Figure 10. We also experiment with varying the artistic styles through keywords in the text input to SD in Figures 11, 12 and 13.

Appendix C Sampling methods

Since MCM is trained to predict modulation parameters for each timestep independently, we are flexible in the choice of sampling technique. We provide examples using DDPM (Ho et al., 2020) with all 1000 steps (Figures 17, 18 and 19 for CelebA and Figure 21 for Mountains) as well as additional results using DDIM (Song et al., 2020) with 200 uniformly sampled steps (Figures 14, 15 and 16 for CelebA and Figure 20 for Mountains).

We omit results using PLMS (Liu et al., 2022a) because of the high similarity to the samples produced with DDIM when using the same input noise $z_{T}$ .

Appendix D Effect of dataset size on MCM and ControlNet

In Tables 6 and 7, we further reduce the number of training examples and compare against ControlNet (Zhang and Agrawala, 2023). ControlNet is concurrent work aims to add new conditioning modalities to pretrained diffusion models by training a copy of the diffusion model’s weights. The “trainable copy” is used to modulate the features of the original “locked copy”, and thus can be seen as a variant of our approach with direct access to the diffusion model and better initialization. We modify ControlNet similarly to MCM in order accommodate multimodal synthesis (e.g., concatenating modalities, using modality dropout). We evaluate the best overall checkpoints for each ControlNet to MCM, which were all trained for the same number of epochs, and find that the two methods perform comparably even though ControlNet has significantly more trainable parameters. We observe that reducing the number of training examples for ControlNet leads to both poorer quality and alignment. Meanwhile, reducing the training examples for MCM produces more photorealistic and diverse images, but the images have poorer alignment to the input conditions.

Appendix E MCM with pixel-based diffusion models

We apply MCM to a pixel-based diffusion model in Figure 22. We use a public checkpoint for an unconditional model trained on CelebA at $64\times 64$ resolution111https://github.com/ermongroup/ddim. We use the same architecture and setup as described in Section 4 with one small modification: before applying Equation 4 to the predicted denoised image $x_{0}^{\prime}$ , we use static thresholding on $x_{0}^{\prime}$ by clipping the values to $[-1,1]$ .

Bibliography62

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Avrahami et al . (2022) Omri Avrahami, Thomas Hayes, Oran Gafni, Sonal Gupta, Yaniv Taigman, Devi Parikh, Dani Lischinski, Ohad Fried, and Xi Yin. 2022. Spa Text: Spatio-Textual Representation for Controllable Image Generation. ar Xiv preprint ar Xiv:2211.14305 (2022).
3Balaji et al . (2022) Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, et al . 2022. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. ar Xiv preprint ar Xiv:2211.01324 (2022).
4Benenson et al . (2019) Rodrigo Benenson, Stefan Popov, and Vittorio Ferrari. 2019. Large-scale interactive object segmentation with human annotators. In CVPR .
5Chen et al . (2017) Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2017. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40, 4 (2017), 834–848.
6Choi et al . (2022) Jooyoung Choi, Jungbeom Lee, Chaehun Shin, Sungwon Kim, Hyunwoo Kim, and Sungroh Yoon. 2022. Perception Prioritized Training of Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 11472–11481.
7Dhariwal and Nichol (2021) Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems 34 (2021), 8780–8794.
8Esser et al . (2021) Patrick Esser, Robin Rombach, and Bjorn Ommer. 2021. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition . 12873–12883.