TL;DR
This paper introduces comprehensive benchmarks and a unified evaluation protocol for assessing layout-guided text-to-image diffusion models, focusing on semantic and spatial alignment in both controlled and real-world scenarios.
Contribution
It presents two new benchmarks, C-Bench and O-Bench, and a unified evaluation method to systematically compare layout-guided diffusion models.
Findings
Large-scale evaluation of six state-of-the-art models conducted
Model rankings based on overall performance and detailed alignment analysis
Fine-grained insights into strengths and limitations of current models
Abstract
Evaluating layout-guided text-to-image generative models requires assessing both semantic alignment with textual prompts and spatial fidelity to prescribed layouts. Assessing layout alignment requires collecting fine-grained annotations, which is costly and labor-intensive. Consequently, current benchmarks rarely provide comprehensive layout evaluation and often remain limited in scale or coverage, making model comparison, ranking, and interpretation difficult. In this work, we introduce a closed-set benchmark (C-Bench) designed to isolate key generative capabilities while providing varying levels of complexity in both prompt structure and layout. To complement this controlled setting, we propose an open-set benchmark (O-Bench) that evaluates models using real-world prompts and layouts, offering a measure of semantic and spatial alignment in the wild. We further develop a unified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
