DivCon: Divide and Conquer for Complex Numerical and Spatial Reasoning in Text-to-Image Generation

Yuhao Jia; Wenhan Tan

arXiv:2403.06400·cs.CV·March 10, 2026·1 cites

DivCon: Divide and Conquer for Complex Numerical and Spatial Reasoning in Text-to-Image Generation

Yuhao Jia, Wenhan Tan

PDF

Open Access

TL;DR

DivCon introduces a divide-and-conquer method for text-to-image generation that enhances numerical and spatial reasoning, enabling lightweight models to produce more accurate and complex images from detailed prompts.

Contribution

The paper presents a novel divide-and-conquer framework that decouples layout prediction and image synthesis, improving scalability and performance in complex T2I tasks.

Findings

01

Outperforms previous methods on HRS and NSR-1K benchmarks.

02

Achieves comparable layout accuracy with lightweight LLMs.

03

Significantly improves perceptual quality in complex multi-object image generation.

Abstract

Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements in recent years. To further improve T2I models' capability in numerical and spatial reasoning, layout is employed as an intermedium to bridge large language models and layout-based diffusion models. However, these methods often rely on closed-source, large-scale LLMs for layout prediction, limiting accessibility and scalability. They also struggle with generating images from prompts with multiple objects and complicated spatial relationships. To tackle these challenges, we introduce a divide-and-conquer approach which decouples the generation task into multiple subtasks. First, the layout prediction stage is divided into numerical & spatial reasoning and bounding box visual planning, enabling even lightweight LLMs to achieve layout accuracy comparable to large-scale models. Second, the layout-to-image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization

MethodsDiffusion