# Modulating Pretrained Diffusion Models for Multimodal Image Synthesis

**Authors:** Cusuh Ham, James Hays, Jingwan Lu, Krishna Kumar Singh, Zhifei Zhang,, Tobias Hinz

arXiv: 2302.12764 · 2023-05-22

## TL;DR

This paper introduces multimodal conditioning modules (MCM) that enable control over image synthesis in pretrained diffusion models without updating their parameters, allowing for efficient, flexible, and precise multimodal image generation.

## Contribution

The paper proposes a novel, lightweight module that modulates pretrained diffusion models for multimodal conditioning without fine-tuning the entire network.

## Key findings

- MCM enables spatial control over generated images.
- Training MCM is computationally inexpensive and requires few examples.
- MCM improves alignment between generated images and conditioning inputs.

## Abstract

We present multimodal conditioning modules (MCM) for enabling conditional image synthesis using pretrained diffusion models. Previous multimodal synthesis works rely on training networks from scratch or fine-tuning pretrained networks, both of which are computationally expensive for large, state-of-the-art diffusion models. Our method uses pretrained networks but \textit{does not require any updates to the diffusion network's parameters}. MCM is a small module trained to modulate the diffusion network's predictions during sampling using 2D modalities (e.g., semantic segmentation maps, sketches) that were unseen during the original training of the diffusion model. We show that MCM enables user control over the spatial layout of the image and leads to increased control over the image generation process. Training MCM is cheap as it does not require gradients from the original diffusion net, consists of only $\sim$1$\%$ of the number of parameters of the base diffusion model, and is trained using only a limited number of training examples. We evaluate our method on unconditional and text-conditional models to demonstrate the improved control over the generated images and their alignment with respect to the conditioning inputs.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2302.12764/full.md

## Figures

28 figures with captions in the complete paper: https://tomesphere.com/paper/2302.12764/full.md

## References

62 references — full list in the complete paper: https://tomesphere.com/paper/2302.12764/full.md

---
Source: https://tomesphere.com/paper/2302.12764