Sketch-Guided Text-to-Image Diffusion Models
Andrey Voynov, Kfir Aberman, Daniel Cohen-Or

TL;DR
This paper introduces a universal, training-free method to guide pretrained text-to-image diffusion models using spatial maps like sketches, enabling flexible, out-of-domain control over generated images.
Contribution
It proposes a Latent Guidance Predictor (LGP), a small MLP trained on few images, to steer diffusion models with spatial maps without retraining the entire model.
Findings
Effective sketch-to-image translation with arbitrary styles.
Outperforms previous methods in out-of-domain sketch guidance.
Requires only a few thousand images for training the LGP.
Abstract
Text-to-Image models have introduced a remarkable leap in the evolution of machine learning, demonstrating high-quality synthesis of images from a given text-prompt. However, these powerful pretrained models still lack control handles that can guide spatial properties of the synthesized images. In this work, we introduce a universal approach to guide a pretrained text-to-image diffusion model, with a spatial map from another domain (e.g., sketch) during inference time. Unlike previous works, our method does not require to train a dedicated model or a specialized encoder for the task. Our key idea is to train a Latent Guidance Predictor (LGP) - a small, per-pixel, Multi-Layer Perceptron (MLP) that maps latent features of noisy images to spatial maps, where the deep features are extracted from the core Denoising Diffusion Probabilistic Model (DDPM) network. The LGP is trained only on a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques
MethodsDiffusion
