When Do Diffusion Models learn to Generate Multiple Objects?

Yujin Jeong; Arnas Uselis; Iro Laina; Seong Joon Oh; Anna Rohrbach

arXiv:2605.00273·cs.CV·May 4, 2026

When Do Diffusion Models learn to Generate Multiple Objects?

Yujin Jeong, Arnas Uselis, Iro Laina, Seong Joon Oh, Anna Rohrbach

PDF

1 Repo

TL;DR

This paper investigates when diffusion models learn to generate multiple objects, revealing that scene complexity and data composition significantly impact their multi-object generation capabilities.

Contribution

The study introduces a controlled dataset framework and systematically analyzes how data effects influence diffusion models' multi-object scene generation.

Findings

01

Scene complexity affects diffusion model performance more than concept imbalance.

02

Counting objects is particularly challenging in low-data regimes.

03

Compositional generalization deteriorates as more concept combinations are excluded during training.

Abstract

Text-to-image diffusion models achieve impressive visual fidelity, yet they remain unreliable in multi-object generation. Despite extensive empirical evidence of these failures, the underlying causes remain unclear. We begin by asking how much of this limitation arises from the data itself. To disentangle data effects, we consider two regimes across different dataset sizes: (1) concept generalization, where each individual concept is observed during training under potentially imbalanced data distributions, and (2) compositional generalization, where specific combinations of concepts are systematically held out. To study these regimes, we introduce mosaic (Multi-Object Spatial relations, AttrIbution, Counting), a controlled framework for dataset generation. By training diffusion models on mosaic, we find that scene complexity plays a dominant role rather than concept imbalance, and that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

eugene6923/MOSAIC
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.