ComposeAnything: Composite Object Priors for Text-to-Image Generation

Zeeshan Khan; Shizhe Chen; Cordelia Schmid

arXiv:2505.24086·cs.CV·June 2, 2025

ComposeAnything: Composite Object Priors for Text-to-Image Generation

Zeeshan Khan, Shizhe Chen, Cordelia Schmid

PDF

Open Access

TL;DR

ComposeAnything introduces a novel framework that uses large language models to generate detailed 2.5D layouts from text, guiding diffusion models to produce complex, coherent images with improved object arrangements without retraining existing models.

Contribution

It presents a new method leveraging LLMs for 2.5D layout generation to enhance compositional image synthesis in text-to-image models without retraining.

Findings

01

Outperforms state-of-the-art on T2I-CompBench and NSR-1K benchmarks.

02

Produces high-quality images with faithful compositions according to text prompts.

03

Enables seamless generation of complex and surreal object arrangements.

Abstract

Generating images from text involving complex and novel object arrangements remains a significant challenge for current text-to-image (T2I) models. Although prior layout-based methods improve object arrangements using spatial constraints with 2D layouts, they often struggle to capture 3D positioning and sacrifice quality and coherence. In this work, we introduce ComposeAnything, a novel framework for improving compositional image generation without retraining existing T2I models. Our approach first leverages the chain-of-thought reasoning abilities of LLMs to produce 2.5D semantic layouts from text, consisting of 2D object bounding boxes enriched with depth information and detailed captions. Based on this layout, we generate a spatial and depth aware coarse composite of objects that captures the intended composition, serving as a strong and interpretable prior that replaces stochastic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Augmented Reality Applications · Video Analysis and Summarization

MethodsAttentive Walk-Aggregating Graph Neural Network