Conditional Diffusion on Web-Scale Image Pairs leads to Diverse Image Variations
Manoj Kumar, Neil Houlsby, Emiel Hoogeboom

TL;DR
This paper introduces Semantica, a diffusion model trained on web-scale image pairs for diverse and contextually consistent image variations, highlighting a new pretraining strategy and evaluation metrics.
Contribution
The paper presents a novel pretraining approach for diffusion models using web-scale image pairs to generate diverse image variations with semantic consistency.
Findings
Semantica can generate diverse variations from dataset images.
The model's performance depends on the choice of image encoder.
Proposed metrics better evaluate image variations than standard metrics.
Abstract
Generating image variations, where a model produces variations of an input image while preserving the semantic context has gained increasing attention. Current image variation techniques involve adapting a text-to-image model to reconstruct an input image conditioned on the same image. We first demonstrate that a diffusion model trained to reconstruct an input image from frozen embeddings, can reconstruct the image with minor variations. Second, inspired by how text-to-image models learn from web-scale text-image pairs, we explore a new pretraining strategy to generate image variations using a large collection of image pairs. Our diffusion model \textit{Semantica} receives a random (encoded) image from a webpage as conditional input and denoises another noisy random image from the same webpage. We carefully examine various design choices for the image encoder, given its crucial role in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management
MethodsDiffusion
