TL;DR
InstanceGen introduces a novel method combining image-based structural guidance with language model instructions to generate images that accurately reflect complex, instance-level prompts including object counts, attributes, and spatial relations.
Contribution
The paper presents a new approach that integrates fine-grained structural initialization with language instructions for improved image generation fidelity.
Findings
Enhanced adherence to complex prompts with multiple objects and attributes
Better spatial and instance-level control in generated images
Outperforms existing methods in capturing detailed prompt semantics
Abstract
Despite rapid advancements in the capabilities of generative models, pretrained text-to-image models still struggle in capturing the semantics conveyed by complex prompts that compound multiple objects and instance-level attributes. Consequently, we are witnessing growing interests in integrating additional structural constraints, typically in the form of coarse bounding boxes, to better guide the generation process in such challenging cases. In this work, we take the idea of structural guidance a step further by making the observation that contemporary image generation models can directly provide a plausible fine-grained structural initialization. We propose a technique that couples this image-based structural guidance with LLM-based instance-level instructions, yielding output images that adhere to all parts of the text prompt, including object counts, instance-level attributes, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
