SceneForge: Structured World Supervision from 3D Interventions
Jizhizi Li, Jiayang Ao, Danny Wicks, Petru-Daniel Tudosiu

TL;DR
SceneForge introduces a framework that generates consistent, structured supervision from editable 3D scene states, enhancing multimodal learning with counterfactual and multi-view data.
Contribution
It presents SceneForge, a novel intervention-driven method that creates scalable, scene-aware supervision from 3D world representations for improved multimodal learning.
Findings
Improves object and scene removal performance across benchmarks.
Produces aligned multi-view and counterfactual observations from shared scene states.
Demonstrates scalability with over 2,000 scenes and diverse settings.
Abstract
Many multimodal learning tasks require supervision that remains consistent across edits, viewpoints, and scene-level interventions. However, such supervision is difficult to obtain from observation-level datasets, which do not expose the underlying scene state or how changes propagate through it. We present SceneForge, an intervention-driven framework that generates structured supervision from editable 3D world states. SceneForge represents each scene as a persistent world with semantic, geometric, and physical dependencies. By applying explicit interventions (e.g., object removal or camera variation) and propagating their effects through scene dependencies, SceneForge renders supervision that remains consistent with object structure and scene-level effects. This produces aligned outputs including counterfactual observations, multi-view observations, and effect-aware signals such as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
