DivCon: Divide and Conquer for Complex Numerical and Spatial Reasoning in Text-to-Image Generation
Yuhao Jia, Wenhan Tan

TL;DR
DivCon introduces a divide-and-conquer method for text-to-image generation that enhances numerical and spatial reasoning, enabling lightweight models to produce more accurate and complex images from detailed prompts.
Contribution
The paper presents a novel divide-and-conquer framework that decouples layout prediction and image synthesis, improving scalability and performance in complex T2I tasks.
Findings
Outperforms previous methods on HRS and NSR-1K benchmarks.
Achieves comparable layout accuracy with lightweight LLMs.
Significantly improves perceptual quality in complex multi-object image generation.
Abstract
Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements in recent years. To further improve T2I models' capability in numerical and spatial reasoning, layout is employed as an intermedium to bridge large language models and layout-based diffusion models. However, these methods often rely on closed-source, large-scale LLMs for layout prediction, limiting accessibility and scalability. They also struggle with generating images from prompts with multiple objects and complicated spatial relationships. To tackle these challenges, we introduce a divide-and-conquer approach which decouples the generation task into multiple subtasks. First, the layout prediction stage is divided into numerical & spatial reasoning and bounding box visual planning, enabling even lightweight LLMs to achieve layout accuracy comparable to large-scale models. Second, the layout-to-image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization
MethodsDiffusion
