VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space
Lin Li, Zehuan Huang, Haoran Feng, Gengxiong Zhuang, Rui Chen, Chunchao Guo, Lu Sheng

TL;DR
VoxHammer introduces a training-free method for precise and coherent 3D editing directly in the native 3D space by leveraging inversion trajectories and latent space manipulation, outperforming existing techniques in consistency and quality.
Contribution
The paper presents VoxHammer, a novel training-free approach for 3D editing that maintains high consistency and coherence, utilizing inversion trajectories and latent space features.
Findings
VoxHammer outperforms existing methods in 3D consistency.
Constructed Edit3D-Bench for evaluating 3D editing quality.
Achieves high-quality, coherent 3D edits without training.
Abstract
3D local editing of specified regions is crucial for game industry and robot interaction. Recent methods typically edit rendered multi-view images and then reconstruct 3D models, but they face challenges in precisely preserving unedited regions and overall coherence. Inspired by structured 3D generative models, we propose VoxHammer, a novel training-free approach that performs precise and coherent editing in 3D latent space. Given a 3D model, VoxHammer first predicts its inversion trajectory and obtains its inverted latents and key-value tokens at each timestep. Subsequently, in the denoising and editing phase, we replace the denoising features of preserved regions with the corresponding inverted latents and cached key-value tokens. By retaining these contextual features, this approach ensures consistent reconstruction of preserved areas and coherent integration of edited parts. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
