TL;DR
This paper investigates when diffusion models learn to generate multiple objects, revealing that scene complexity and data composition significantly impact their multi-object generation capabilities.
Contribution
The study introduces a controlled dataset framework and systematically analyzes how data effects influence diffusion models' multi-object scene generation.
Findings
Scene complexity affects diffusion model performance more than concept imbalance.
Counting objects is particularly challenging in low-data regimes.
Compositional generalization deteriorates as more concept combinations are excluded during training.
Abstract
Text-to-image diffusion models achieve impressive visual fidelity, yet they remain unreliable in multi-object generation. Despite extensive empirical evidence of these failures, the underlying causes remain unclear. We begin by asking how much of this limitation arises from the data itself. To disentangle data effects, we consider two regimes across different dataset sizes: (1) concept generalization, where each individual concept is observed during training under potentially imbalanced data distributions, and (2) compositional generalization, where specific combinations of concepts are systematically held out. To study these regimes, we introduce mosaic (Multi-Object Spatial relations, AttrIbution, Counting), a controlled framework for dataset generation. By training diffusion models on mosaic, we find that scene complexity plays a dominant role rather than concept imbalance, and that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
