Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training
Ziqi Gao, Weikai Huang, Jieyu Zhang, Aniruddha Kembhavi, Ranjay Krishna

TL;DR
This paper introduces Generate Any Scene, a data engine that creates diverse, complex scene graphs for training visual generation models, improving their compositional understanding and semantic alignment through synthetic data and iterative self-improvement.
Contribution
The paper presents a novel scene graph-based data synthesis framework that enhances visual generation models' performance and semantic accuracy, including a self-improving training loop, a distillation method, and a low-cost reward model.
Findings
Stable Diffusion v1.5 improves by 4% with generated data.
Fewer than 800 synthetic captions boost compositional generation by 10%.
The reward model surpasses CLIP-based methods by 5%.
Abstract
Recent advances in text-to-vision generation excel in visual fidelity but struggle with compositional generalization and semantic alignment. Existing datasets are noisy and weakly compositional, limiting models' understanding of complex scenes, while scalable solutions for dense, high-quality annotations remain a challenge. We introduce Generate Any Scene, a data engine that systematically enumerates scene graphs representing the combinatorial array of possible visual scenes. Generate Any Scene dynamically constructs scene graphs of varying complexity from a structured taxonomy of objects, attributes, and relations. Given a sampled scene graph, Generate Any Scene translates it into a caption for text-to-image or text-to-video generation; it also translates it into a set of visual question answers that allow automatic evaluation and reward modeling of semantic alignment. Using Generate…
Peer Reviews
Decision·ICLR 2026 Poster
This paper presents a useful data generation pipeline based on scene graphs that can be incorporated into model self-improvement pipelines for image and video generation. The paper provides extensive experimental results and the proposed approach is technically sound. Given the reported results this approach can be useful for researchers working across this field.
My main concerns with this paper are as follows: * Technical novelty: the paper contributes very little technical novelty as its main contribution is the provision of components for scene graph sampling. It would have been interesting to see (at least a discussion of) different, more advanced sampling methods from the data repository to further improve downstream task performance. * Limitations: the paper does not seem to discuss limitations of their proposed approach. It would be great if addit
1. The paper identifies a clear and significant problem in text-to-vision generation: the lack of structured, compositionally rich training data. The proposed solution using a programmatic scene graph engine to systematically enumerate the visual space is well-motivated. 2. The paper provides extensive experimental validation across multiple tasks (text-to-image, video, 3D), models, and benchmarks (DPG-Bench, GenEval, GenAI-Bench). 3. A significant strength is the demonstration of the system'
1. I am concerned that randomly sampling metadata (objects, attributes, relations) may result in implausible or nonsensical scenes (e.g., "crispy dog holding a rabbit reciting a poem"). I am not sure if he "commonsense plausibility filtering" mentioned in Section 2.1 addresses this problem, it is not clearly described, making it difficult to assess its effectiveness. For scalable, fully automated operation, this remains a potential source of noise and inefficiency, as the T2I model and evaluator
1. The paper is well written and easy to follow. Multiple tasks are tackled with the proposed approach, demonstrating the broad applicability. 2. Extensive experiments are performed, trying to support the claims of the paper.
1. It is understandable that the proposed data engine can generate faithfully compositional captions. However, it is not convincing that such an engine can improve the compositional generation performance of a text-to-image generative model. Once a text-to-image generative model is trained, the distribution is there. When data engine and VQA score are used to select self-generated images and finetune the model, it is just biasing the model to overfit region of high VQA score. Then of course, the
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducational Games and Gamification · Multimodal Machine Learning Applications · Topic Modeling
MethodsSoftmax · Attention Is All You Need · ALIGN
