BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing

Jiacheng Chen; Ramin Mehran; Xuhui Jia; Saining Xie; Sanghyun Woo

arXiv:2506.17450·cs.CV·September 3, 2025

BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing

Jiacheng Chen, Ramin Mehran, Xuhui Jia, Saining Xie, Sanghyun Woo

PDF

4 Models 3 Reviews

TL;DR

BlenderFusion is a novel framework that enables 3D-grounded visual editing and scene compositing by integrating segmentation, editing, and a generative diffusion-based compositor, allowing flexible and high-quality scene modifications.

Contribution

It introduces a layered editing pipeline combined with a diffusion-based generative compositor fine-tuned for scene editing, which is a new approach in 3D scene compositing.

Findings

01

Outperforms prior methods in complex scene editing tasks

02

Enables flexible background replacement and object manipulation

03

Provides disentangled control over objects and camera movements

Abstract

We present BlenderFusion, a generative visual compositing framework that synthesizes new scenes by recomposing objects, camera, and background. It follows a layering-editing-compositing pipeline: (i) segmenting and converting visual inputs into editable 3D entities (layering), (ii) editing them in Blender with 3D-grounded control (editing), and (iii) fusing them into a coherent scene using a generative compositor (compositing). Our generative compositor extends a pre-trained diffusion model to process both the original (source) and edited (target) scenes in parallel. It is fine-tuned on video frames with two key training strategies: (i) source masking, enabling flexible modifications like background replacement; (ii) simulated object jittering, facilitating disentangled control over objects and camera. BlenderFusion significantly outperforms prior methods in complex compositional scene…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

Clear, production-like workflow; easy to implement. Results suggest better local control on some inserts.

Weaknesses

**Key Baselines Omitted:** - ZeroComp[1]: composites intrinsic layers (depth/normal/albedo/shading) and lets diffusion render the final image. Similar goal but without using Blender directly. But they use a rendering engine to give approx 3D compositing. - DiffusionRenderer [2]: turns G-buffers into photoreal images/videos; direct alternative to “Blender render to diffusion fix.” - 2D diffusion compositors: ObjectStitch [3], Thinking Outside the BBox [4], ControlCom [5], IMPRINT [6]: Generativ

Reviewer 02Rating 4Confidence 3

Strengths

1. Clear Motivation and Strong Problem Formulation: The paper clearly identifies a significant and practical limitation in current generative AI: the lack of precise, 3D-aware control for complex, multi-object scene compositing. It effectively positions its contribution against existing methods (Table 1), clearly highlighting the gap it aims to fill. 2. Novel and Elegant Framework Design: The primary strength of this work lies in its core idea of decoupling 3D control from generative synthesis.

Weaknesses

1. Insufficient Detail on the Core Technical Novelty (Sec. 3.2): The paper's primary methodological contribution, the "Dual-stream Diffusion Compositor" in Section 3.2, is not described with sufficient clarity. The architecture is presented as a high-level black box, and the paper fails to provide a detailed diagram or explanation of the crucial "cross-stream interaction" mechanism. It is strongly recommended that the authors add a dedicated figure and more detailed text to fully articulate this

Reviewer 03Rating 8Confidence 4

Strengths

• The paper is well-written with a logical structure that makes the technical contributions easy to follow. • The proposed framework is reasonable and well-justified. The experimental results convincingly demonstrate the effectiveness of the approach across various compositing scenarios. • Excellent supplementary materials: The demo videos and project page significantly aid in understanding the core concepts and practical applications of the method.

Weaknesses

• Recent works have explored 3D scene reconstruction and composition capabilities. A more thorough comparison and discussion of the relationship between BlenderFusion and these methods would strengthen the paper. For example: • CAST [1] performs component-aligned 3D scene reconstruction from a single RGB image. How does BlenderFusion's layering approach compare to CAST's decomposition strategy? • What are the trade-offs between the generative compositing approach and traditional 3D reconstructio

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsDiffusion · Softmax · RoIAlign