Text-Guided Synthesis of Artistic Images with Retrieval-Augmented Diffusion Models
Robin Rombach, Andreas Blattmann, Bj\"orn Ommer

TL;DR
This paper introduces retrieval-augmented diffusion models (RDMs) for artistic image synthesis, which improve style control by retrieving style-specific images during training and inference, outperforming traditional prompt-engineering methods.
Contribution
The authors propose a novel retrieval-augmented diffusion approach that allows flexible style specification in image synthesis, surpassing text prompt methods in effectiveness.
Findings
Retrieval-augmented models outperform prompt-engineering in style control
Specialized databases improve style specificity during inference
Open-source code and models are provided for reproducibility
Abstract
Novel architectures have recently improved generative image synthesis leading to excellent visual quality in various tasks. Of particular note is the field of ``AI-Art'', which has seen unprecedented growth with the emergence of powerful multimodal models such as CLIP. By combining speech and image synthesis models, so-called ``prompt-engineering'' has become established, in which carefully selected and composed sentences are used to achieve a certain visual style in the synthesized image. In this note, we present an alternative approach based on retrieval-augmented diffusion models (RDMs). In RDMs, a set of nearest neighbors is retrieved from an external database during training for each training instance, and the diffusion model is conditioned on these informative samples. During inference (sampling), we replace the retrieval database with a more specialized database that contains,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
MethodsDiffusion · Contrastive Language-Image Pre-training
