LLMControl: Grounded Control of Text-to-Image Diffusion-based Synthesis with Multimodal LLMs
Jiaze Wang, Rui Chen, Haowang Cui

TL;DR
LLM_Control introduces a multimodal LLM-guided framework to improve spatial and semantic control in text-to-image diffusion models, enabling more precise and complex image synthesis from detailed prompts.
Contribution
This work presents a novel LLM-guided approach that enhances grounding and control in pre-trained diffusion models for complex T2I synthesis tasks.
Findings
Achieves competitive quality with state-of-the-art methods
Effectively handles complex spatial compositions and multiple objects
Improves adherence to control conditions in image generation
Abstract
Recent spatial control methods for text-to-image (T2I) diffusion models have shown compelling results. However, these methods still fail to precisely follow the control conditions and generate the corresponding images, especially when encountering the textual prompts that contain multiple objects or have complex spatial compositions. In this work, we present a LLM-guided framework called LLM\_Control to address the challenges of the controllable T2I generation task. By improving grounding capabilities, LLM\_Control is introduced to accurately modulate the pre-trained diffusion models, where visual conditions and textual prompts influence the structures and appearance generation in a complementary way. We utilize the multimodal LLM as a global controller to arrange spatial layouts, augment semantic descriptions and bind object attributes. The obtained control signals are injected into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis
