Generative Blocks World: Moving Things Around in Pictures

Vaibhav Vavilala; Seemandhar Jain; Rahul Vasanth; D.A. Forsyth; and Anand Bhattad

arXiv:2506.20703·cs.GR·March 23, 2026

Generative Blocks World: Moving Things Around in Pictures

Vaibhav Vavilala, Seemandhar Jain, Rahul Vasanth, D.A. Forsyth, and Anand Bhattad

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Generative Blocks World, a method for editing generated images by manipulating 3D primitives, enabling accurate, consistent, and flexible scene modifications with improved visual quality.

Contribution

It presents a novel scene representation using convex 3D primitives and a flow-based image generation conditioned on depth and texture hints, enhancing editability and visual fidelity.

Findings

01

Outperforms prior methods in visual fidelity

02

Enables accurate object and camera movements

03

Preserves object identity during edits

Abstract

We describe Generative Blocks World to interact with the scene of a generated image by manipulating simple geometric abstractions. Our method represents scenes as assemblies of convex 3D primitives, and the same scene can be represented by different numbers of primitives, allowing an editor to move either whole structures or small details. Once the scene geometry has been edited, the image is generated by a flow-based method, which is conditioned on depth and a texture hint. Our texture hint takes into account the modified 3D primitives, exceeding the texture-consistency provided by existing techniques. These texture hints (a) allow accurate object and camera moves and (b) preserve the identity of objects. Our experiments demonstrate that our approach outperforms prior works in visual fidelity, editability, and compositional generalization.

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

1. This method is training free, by leveraging existing models, it is able to provide a very good image editing result 2. The speed is very fast, better than other 3d aware editing methods I known 3. The texture Hint injection is pretty good, do not require training the diffusion model, just doing the injection during inference, but still get a very good result. It even keeps the text texture, which is amazing. 4. Overall, I love this paper very much, it leverages foundamental graphics technique

Weaknesses

Thanks a lot for the authors taking a time to discuss the failure cases, I feel the discussions are very valuable. All the weaknesses are acceptable, and I feels some can be solved by more advanced models within this framework. For example, the first row in Figure 8 is very likely to be a failure of Flux Depth, not the framework itself.

Reviewer 02Rating 4Confidence 3

Strengths

Lots of qualitative demos for the paper and the quality of the model at preserving scene contents.

Weaknesses

Very minimal quantitative results to compare with other works, quantitative results feel like they're lacking overall. There are other works that do a similar style of task prompting from optical flow / correspondences (motion prompting and go-with-the-flow were the ones I knew, but you also referenced drag-diffusion). I might be wrong, but it feels like some more quantitative comparisons could be done against these kinds of models perhaps? This paper is an interesting way of approaching the pro

Reviewer 03Rating 4Confidence 3

Strengths

1. Novelty and Motivation: The approach of revitalizing classic "blocks world" concepts for controlling modern generative models is highly innovative. It provides a clear and compelling solution for 3D-aware image manipulation, which is a significant problem in the field. 2. Decoupling of Geometry and Texture: The framework effectively decouples geometric control (via primitives and depth maps) from appearance generation (via the generative model and texture hints). This modularity is a key str

Weaknesses

1. Limited Quantitative Evaluation: The quantitative comparison is confined to a single baseline (LooseControl) on a small, unstated set of test images. The paper's claims of superiority would be significantly strengthened by a more extensive evaluation on standard image editing benchmarks and against a wider array of recent methods, especially those with different interaction paradigms (e.g., drag-based). 2. Lack of Critical Ablation Studies: The paper is missing important ablation studies tha

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArt, Technology, and Culture · Digital Games and Media · Digital Humanities and Scholarship

MethodsHierarchical Information Threading