Tinker: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization

Canyu Zhao; Xiaoman Li; Tianjian Feng; Zhiyue Zhao; Hao Chen; Chunhua Shen

arXiv:2508.14811·cs.CV·August 21, 2025

Tinker: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization

Canyu Zhao, Xiaoman Li, Tianjian Feng, Zhiyue Zhao, Hao Chen, Chunhua Shen

PDF

Open Access 3 Reviews

TL;DR

Tinker is a novel 3D editing framework that achieves multi-view consistent edits from sparse inputs without per-scene optimization by leveraging pretrained diffusion models and a new multi-view editing dataset.

Contribution

It introduces a zero-shot 3D editing method using diffusion models, with new components for reference-driven editing and scene completion from sparse inputs.

Findings

01

Achieves state-of-the-art results in 3D editing and view synthesis.

02

Operates without per-scene finetuning, enabling scalable 3D content creation.

03

Reduces barriers to generalizable 3D editing from minimal input data.

Abstract

We introduce Tinker, a versatile framework for high-fidelity 3D editing that operates in both one-shot and few-shot regimes without any per-scene finetuning. Unlike prior techniques that demand extensive per-scene optimization to ensure multi-view consistency or to produce dozens of consistent edited input views, Tinker delivers robust, multi-view consistent edits from as few as one or two images. This capability stems from repurposing pretrained diffusion models, which unlocks their latent 3D awareness. To drive research in this space, we curate the first large-scale multi-view editing dataset and data pipeline, spanning diverse scenes and styles. Building on this dataset, we develop our framework capable of generating multi-view consistent edited views without per-scene training, which consists of two novel components: (1) Referring multi-view editor: Enables precise, reference-driven…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. The proposed method is efficient, reasonable, and offers high quality. 2. The writing is good. 3. The proposed dataset will be very helpful to this field.

Weaknesses

1. The main issue is that this method looks very complicated, though it is necessary to make the 3D editing feed-forward. Still, too many components are involved in this process. Overall, I think this is a good paper and worth acceptance. I just encourage the authors to think of the next step and tackle this task in a more elegant way.

Reviewer 02Rating 6Confidence 4

Strengths

1. This paper introduces a one-shot or few-shot approach for 3D editing. 2. This paper proposes a generalizable pipeline for 3D editing. 3. The paper is well-structured and easy to follow.

Weaknesses

1. For the multi-view image editing model, when dealing with views that have large variations, does the multi-view consistency of the edits decrease? If so, could these inconsistencies be further propagated and amplified by the subsequent scene completion model? 2. Since the scene completion model relies on geometric information like depth maps, in few-shot or even one-shot settings with large view variations, is it prone to introducing more hallucinations or geometric distortions to fill in the

Reviewer 03Rating 4Confidence 4

Strengths

1. The visual results are impressive and demonstrate the effectiveness of the proposed method. 2. The use of a video model for 3D editing is an innovative approach. With video generative priors, TINKER is the first method capable of jointly editing both 3D and 4D scenes.

Weaknesses

1. There is an over-claim of contributions. Many baselines, such as DGE and GaussCtrl, also do not require fine-tuning the diffusion model. Therefore, this should not be considered a unique contribution of TINKER. 2. The majority of the baselines use InstructNerf2Nerf or ControlNet as the base 2D editors, whereas TINKER utilizes the FLUX model. It is unclear where the true improvement lies: is it in the advanced 2D editing model, or is it in the proposed pipeline? What if these baselines were eq

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputer Graphics and Visualization Techniques · Additive Manufacturing and 3D Printing Technologies