Edit3r: Instant 3D Scene Editing from Sparse Unposed Images
Jiageng Liu, Weijie Lyu, Xueting Li, Yejie Guo, Ming-Hsuan Yang

TL;DR
Edit3r is a fast, feed-forward framework that enables instant 3D scene editing from sparse, unposed images by predicting instruction-aligned edits without requiring scene-specific optimization.
Contribution
It introduces a novel training strategy with SAM2-based recoloring and asymmetric input pairing to enable 3D editing from unposed images without multi-view supervision.
Findings
Achieves superior semantic alignment and 3D consistency.
Operates at significantly higher inference speed.
Effective at handling 2D-edited images during inference.
Abstract
We present Edit3r, a feed-forward framework that reconstructs and edits 3D scenes in a single pass from unposed, view-inconsistent, instruction-edited images. Unlike prior methods requiring per-scene optimization, Edit3r directly predicts instruction-aligned 3D edits, enabling fast and photorealistic rendering without optimization or pose estimation. A key challenge in training such a model lies in the absence of multi-view consistent edited images for supervision. We address this with (i) a SAM2-based recoloring strategy that generates reliable, cross-view-consistent supervision, and (ii) an asymmetric input strategy that pairs a recolored reference view with raw auxiliary views, encouraging the network to fuse and align disparate observations. At inference, our model effectively handles images edited by 2D methods such as InstructPix2Pix, despite not being exposed to such edits during…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis · Advanced Vision and Imaging
