Tinker: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization
Canyu Zhao, Xiaoman Li, Tianjian Feng, Zhiyue Zhao, Hao Chen, Chunhua Shen

TL;DR
Tinker is a novel 3D editing framework that achieves multi-view consistent edits from sparse inputs without per-scene optimization by leveraging pretrained diffusion models and a new multi-view editing dataset.
Contribution
It introduces a zero-shot 3D editing method using diffusion models, with new components for reference-driven editing and scene completion from sparse inputs.
Findings
Achieves state-of-the-art results in 3D editing and view synthesis.
Operates without per-scene finetuning, enabling scalable 3D content creation.
Reduces barriers to generalizable 3D editing from minimal input data.
Abstract
We introduce Tinker, a versatile framework for high-fidelity 3D editing that operates in both one-shot and few-shot regimes without any per-scene finetuning. Unlike prior techniques that demand extensive per-scene optimization to ensure multi-view consistency or to produce dozens of consistent edited input views, Tinker delivers robust, multi-view consistent edits from as few as one or two images. This capability stems from repurposing pretrained diffusion models, which unlocks their latent 3D awareness. To drive research in this space, we curate the first large-scale multi-view editing dataset and data pipeline, spanning diverse scenes and styles. Building on this dataset, we develop our framework capable of generating multi-view consistent edited views without per-scene training, which consists of two novel components: (1) Referring multi-view editor: Enables precise, reference-driven…
Peer Reviews
Decision·ICLR 2026 Poster
1. The proposed method is efficient, reasonable, and offers high quality. 2. The writing is good. 3. The proposed dataset will be very helpful to this field.
1. The main issue is that this method looks very complicated, though it is necessary to make the 3D editing feed-forward. Still, too many components are involved in this process. Overall, I think this is a good paper and worth acceptance. I just encourage the authors to think of the next step and tackle this task in a more elegant way.
1. This paper introduces a one-shot or few-shot approach for 3D editing. 2. This paper proposes a generalizable pipeline for 3D editing. 3. The paper is well-structured and easy to follow.
1. For the multi-view image editing model, when dealing with views that have large variations, does the multi-view consistency of the edits decrease? If so, could these inconsistencies be further propagated and amplified by the subsequent scene completion model? 2. Since the scene completion model relies on geometric information like depth maps, in few-shot or even one-shot settings with large view variations, is it prone to introducing more hallucinations or geometric distortions to fill in the
1. The visual results are impressive and demonstrate the effectiveness of the proposed method. 2. The use of a video model for 3D editing is an innovative approach. With video generative priors, TINKER is the first method capable of jointly editing both 3D and 4D scenes.
1. There is an over-claim of contributions. Many baselines, such as DGE and GaussCtrl, also do not require fine-tuning the diffusion model. Therefore, this should not be considered a unique contribution of TINKER. 2. The majority of the baselines use InstructNerf2Nerf or ControlNet as the base 2D editors, whereas TINKER utilizes the FLUX model. It is unclear where the true improvement lies: is it in the advanced 2D editing model, or is it in the proposed pipeline? What if these baselines were eq
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputer Graphics and Visualization Techniques · Additive Manufacturing and 3D Printing Technologies
