Variation-aware Flexible 3D Gaussian Editing
Hao Qin, Yukai Sun, Meng Wang, Ming Kong, Mengxu Lu, Qiang Zhu

TL;DR
VF-Editor enables direct, flexible, and efficient 3D Gaussian primitive editing by predicting attribute variations, overcoming cross-view inconsistencies of previous indirect methods, and effectively transferring 2D editing knowledge to 3D.
Contribution
The paper introduces VF-Editor, a novel method for native 3D Gaussian editing that distills 2D editing knowledge into a unified predictor for improved flexibility and accuracy.
Findings
VF-Editor outperforms indirect editing methods in consistency and flexibility.
The approach effectively transfers diverse 2D editing strategies to 3D.
Experiments demonstrate significant improvements on multiple datasets.
Abstract
Indirect editing methods for 3D Gaussian Splatting (3DGS) have recently witnessed significant advancements. These approaches operate by first applying edits in the rendered 2D space and subsequently projecting the modifications back into 3D. However, this paradigm inevitably introduces cross-view inconsistencies and constrains both the flexibility and efficiency of the editing process. To address these challenges, we present VF-Editor, which enables native editing of Gaussian primitives by predicting attribute variations in a feedforward manner. To accurately and efficiently estimate these variations, we design a novel variation predictor distilled from 2D editing knowledge. The predictor encodes the input to generate a variation field and employs two learnable, parallel decoding functions to iteratively infer attribute changes for each 3D Gaussian. Thanks to its unified design,…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper responds to a well-recognized limitation in 3DGS editing: the cross-view inconsistencies inherent to indirect, 2D-edit-then-project pipelines. The authors' framing is accurate and well-motivated, making a convincing case for the need for direct, native 3D editing. 2. The feedforward variation predictor architecture is novel. The random tokenizer, transformer-based variation field generator, and parallel iterative decoding functions offer a clear path to efficient, scalable editing.
1. Although the Related Work section is relatively comprehensive for 3DGS and 2D distillation methods, several highly pertinent and recent methods are missing. In particular: - 3DSceneEditor (Yan et al., 2024) is another fully 3D-based native editing pipeline leveraging Gaussian Splatting. This work should be directly compared with or discussed in Section 2 and as a baseline in Section 4.2/Table 2. - Gaussian Splatting in Style (Saroha et al., 2024), which introduces neural style transfe
Predicting changes (Δ) instead of the final result is a smart and natural fit for 3D Gaussian Splatting. Since 3DGS is made up of explicit, editable primitives, it makes more sense to directly modificate their parameters rather than trying to infer 3D edits indirectly from 2D images. The feed-forward nature provides a significant speed-up (0.3s) over iterative optimization methods.
Data Dependency: The entire framework is built on offline triplet collection ($\mathcal{L}_{din}$). Table 1 indicates that 28,932 triplets were required for only 20 instructions. This approach seems to scale very poorly for a truly "open-vocabulary" editor. The paper admits in Sec. 4.6 that it does not support "out-of-domain editing" without fine-tuning (Fig. 14). This suggests the model is learning a mapping for a fixed set of instructions, not a general-purpose, compositional understanding of
The main strength of the paper is a neat problem reformulation: instead of predicting edited Gaussians outright, the proposed pipeline predicts per-primitive variations and composes them with the source. This gives a controllable, native 3D editing interface and sidesteps multi-view back-projection issues. Such a framing, together with the random tokenizer and the iterative, parallel decoders for position versus other attributes, feels fresh within 3DGS editing and is well-motivated by the repre
I think there are a couple areas for improvement that are worth discussing. These cluster around data coverage, evaluation, and metodology. - The training data is well assembled but perhaps is still small and skewed toward objects, with only a handful of scenes; admittedly the authors note lack of ood support (e.g. new categories or environments), which constrains claims of universality and open-vocabulary editing. A more convincing path may add diverse indoor/outdoor scenes, articulated human
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · 3D Printing in Biomedical Research · Interactive and Immersive Displays
