TL;DR
This paper introduces a weakly-supervised method for controllable image generation that uses semantic maps and textual descriptions, enabling detailed scene manipulation with improved quality and flexibility.
Contribution
It presents a novel weakly-supervised framework with a semantic attention module and a two-step generation process for enhanced scene control.
Findings
Achieves better FID scores than fully-supervised models.
Demonstrates scene manipulation on complex datasets like COCO and Visual Genome.
Utilizes unlabeled data through a large-vocabulary object detector.
Abstract
We propose a weakly-supervised approach for conditional image generation of complex scenes where a user has fine control over objects appearing in the scene. We exploit sparse semantic maps to control object shapes and classes, as well as textual descriptions or attributes to control both local and global style. In order to condition our model on textual descriptions, we introduce a semantic attention module whose computational cost is independent of the image resolution. To further augment the controllability of the scene, we propose a two-step generation scheme that decomposes background and foreground. The label maps used to train our model are produced by a large-vocabulary object detector, which enables access to unlabeled data and provides structured instance information. In such a setting, we report better FID scores compared to fully-supervised settings where the model is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
