Generative Blocks World: Moving Things Around in Pictures
Vaibhav Vavilala, Seemandhar Jain, Rahul Vasanth, D.A. Forsyth, and Anand Bhattad

TL;DR
This paper introduces Generative Blocks World, a method for editing generated images by manipulating 3D primitives, enabling accurate, consistent, and flexible scene modifications with improved visual quality.
Contribution
It presents a novel scene representation using convex 3D primitives and a flow-based image generation conditioned on depth and texture hints, enhancing editability and visual fidelity.
Findings
Outperforms prior methods in visual fidelity
Enables accurate object and camera movements
Preserves object identity during edits
Abstract
We describe Generative Blocks World to interact with the scene of a generated image by manipulating simple geometric abstractions. Our method represents scenes as assemblies of convex 3D primitives, and the same scene can be represented by different numbers of primitives, allowing an editor to move either whole structures or small details. Once the scene geometry has been edited, the image is generated by a flow-based method, which is conditioned on depth and a texture hint. Our texture hint takes into account the modified 3D primitives, exceeding the texture-consistency provided by existing techniques. These texture hints (a) allow accurate object and camera moves and (b) preserve the identity of objects. Our experiments demonstrate that our approach outperforms prior works in visual fidelity, editability, and compositional generalization.
Peer Reviews
Decision·ICLR 2026 Poster
1. This method is training free, by leveraging existing models, it is able to provide a very good image editing result 2. The speed is very fast, better than other 3d aware editing methods I known 3. The texture Hint injection is pretty good, do not require training the diffusion model, just doing the injection during inference, but still get a very good result. It even keeps the text texture, which is amazing. 4. Overall, I love this paper very much, it leverages foundamental graphics technique
Thanks a lot for the authors taking a time to discuss the failure cases, I feel the discussions are very valuable. All the weaknesses are acceptable, and I feels some can be solved by more advanced models within this framework. For example, the first row in Figure 8 is very likely to be a failure of Flux Depth, not the framework itself.
Lots of qualitative demos for the paper and the quality of the model at preserving scene contents.
Very minimal quantitative results to compare with other works, quantitative results feel like they're lacking overall. There are other works that do a similar style of task prompting from optical flow / correspondences (motion prompting and go-with-the-flow were the ones I knew, but you also referenced drag-diffusion). I might be wrong, but it feels like some more quantitative comparisons could be done against these kinds of models perhaps? This paper is an interesting way of approaching the pro
1. Novelty and Motivation: The approach of revitalizing classic "blocks world" concepts for controlling modern generative models is highly innovative. It provides a clear and compelling solution for 3D-aware image manipulation, which is a significant problem in the field. 2. Decoupling of Geometry and Texture: The framework effectively decouples geometric control (via primitives and depth maps) from appearance generation (via the generative model and texture hints). This modularity is a key str
1. Limited Quantitative Evaluation: The quantitative comparison is confined to a single baseline (LooseControl) on a small, unstated set of test images. The paper's claims of superiority would be significantly strengthened by a more extensive evaluation on standard image editing benchmarks and against a wider array of recent methods, especially those with different interaction paradigms (e.g., drag-based). 2. Lack of Critical Ablation Studies: The paper is missing important ablation studies tha
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArt, Technology, and Culture · Digital Games and Media · Digital Humanities and Scholarship
MethodsHierarchical Information Threading
