Controllable Image Generation via Collage Representations
Arantxa Casanova, Marl\`ene Careil, Adriana Romero-Soriano,, Christopher J. Pal, Jakob Verbeek, Michal Drozdzal

TL;DR
This paper introduces M&Ms, a novel collage-based generative model that enables fine-grained scene control by combining appearance features and spatial positions without relying on class labels, outperforming existing models.
Contribution
The paper presents a new collage-based generative approach conditioned on visual scene descriptions, improving fine-grained controllability and generalization over prior text- and layout-based models.
Findings
Outperforms baselines in scene controllability on OpenImages.
Outperforms DALL-E in zero-shot FID on MS-COCO.
Achieves high image quality and diversity with fewer parameters.
Abstract
Recent advances in conditional generative image models have enabled impressive results. On the one hand, text-based conditional models have achieved remarkable generation quality, by leveraging large-scale datasets of image-text pairs. To enable fine-grained controllability, however, text-based models require long prompts, whose details may be ignored by the model. On the other hand, layout-based conditional models have also witnessed significant advances. These models rely on bounding boxes or segmentation maps for precise spatial conditioning in combination with coarse semantic labels. The semantic labels, however, cannot be used to express detailed appearance characteristics. In this paper, we approach fine-grained scene controllability through image collages which allow a rich visual description of the desired scene as well as the appearance and location of the objects therein,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
