LLMControl: Grounded Control of Text-to-Image Diffusion-based Synthesis with Multimodal LLMs

Jiaze Wang; Rui Chen; Haowang Cui

arXiv:2507.19939·cs.CV·July 29, 2025

LLMControl: Grounded Control of Text-to-Image Diffusion-based Synthesis with Multimodal LLMs

Jiaze Wang, Rui Chen, Haowang Cui

PDF

Open Access

TL;DR

LLM_Control introduces a multimodal LLM-guided framework to improve spatial and semantic control in text-to-image diffusion models, enabling more precise and complex image synthesis from detailed prompts.

Contribution

This work presents a novel LLM-guided approach that enhances grounding and control in pre-trained diffusion models for complex T2I synthesis tasks.

Findings

01

Achieves competitive quality with state-of-the-art methods

02

Effectively handles complex spatial compositions and multiple objects

03

Improves adherence to control conditions in image generation

Abstract

Recent spatial control methods for text-to-image (T2I) diffusion models have shown compelling results. However, these methods still fail to precisely follow the control conditions and generate the corresponding images, especially when encountering the textual prompts that contain multiple objects or have complex spatial compositions. In this work, we present a LLM-guided framework called LLM\_Control to address the challenges of the controllable T2I generation task. By improving grounding capabilities, LLM\_Control is introduced to accurately modulate the pre-trained diffusion models, where visual conditions and textual prompts influence the structures and appearance generation in a complementary way. We utilize the multimodal LLM as a global controller to arrange spatial layouts, augment semantic descriptions and bind object attributes. The obtained control signals are injected into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis