Native 3D Editing with Full Attention
Weiwei Cai, Shuangkang Fang, Weicai Ye, Xin Dong, Yunhan Yang, Xuanyang Zhang, Wei Cheng, Yanpei Cao, Gang Yu, Tao Chen

TL;DR
This paper introduces a fast, native 3D editing framework that directly manipulates 3D representations in a single pass, overcoming limitations of previous slow or inconsistent methods, and achieves superior quality and consistency.
Contribution
It presents a novel native 3D editing approach with a large-scale dataset and two conditioning strategies, notably a new 3D token concatenation method that improves efficiency and performance.
Findings
Outperforms existing 2D-lifting methods in quality and consistency
Token concatenation strategy is more parameter-efficient and effective
Sets a new benchmark in instruction-guided 3D editing
Abstract
Instruction-guided 3D editing is a rapidly emerging field with the potential to broaden access to 3D content creation. However, existing methods face critical limitations: optimization-based approaches are prohibitively slow, while feed-forward approaches relying on multi-view 2D editing often suffer from inconsistent geometry and degraded visual quality. To address these issues, we propose a novel native 3D editing framework that directly manipulates 3D representations in a single, efficient feed-forward pass. Specifically, we create a large-scale, multi-modal dataset for instruction-guided 3D editing, covering diverse addition, deletion, and modification tasks. This dataset is meticulously curated to ensure that edited objects faithfully adhere to the instructional changes while preserving the consistency of unedited regions with the source object. Building upon this dataset, we…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper is clearly written and easy to follow. 2. The experimental comparison is thorough, benchmarking against numerous established baselines.
1. Limited visual quality. The method only supports very simple edits—essentially adding or replacing nearly solid-colored objects. For example, in Figure 3 (row 3), the added signboard is completely white with no texture or pattern. In contrast, HunYuan3D-2.1 produces significantly superior results. Most generated outputs also appear somewhat coarse and fall far short of the fidelity achieved by optimization-based methods (e.g., NANO3D). 2. Lack of novelty. Both the cross-attention strategy an
1) Large-scale dataset construction. The paper creates over 110,000 training samples covering deletion, addition, and modification tasks through a systematic pipeline. 2) Improvements on automatic metrics. Shows substantial gains over baselines (FID: 126.2 to 91.9) on standard benchmarks.
1) The paper claims native 3D editing avoids 2D consistency problems, but evaluates using only 2D image metrics (FID/FVD/CLIP). These metrics cannot measure 3D geometric consistency, mesh quality, or whether edits are correctly localized in 3D space. 2) No 3D geometric metrics. Missing essential measurements like Chamfer Distance on unedited regions, mesh quality checks (self-intersections, non-manifold edges), or 3D spatial accuracy of edits. Cannot verify the claimed advantage of native 3D e
- **Dataset Contribution**: The primary strength of this paper is the introduction of the large-scale dataset specifically for instruction-guided native 3D editing. - **Extensive evaluation results**: The qualitative results shown in Figures 3 and 4 illustrate the method's superiority in maintaining consistency and following instructions.
- **Limited Methodological Novelty**: While the overall framework is effective, its methodological novelty is somewhat limited. The backbone architecture is an existing pre-trained model (TRELLIS). The core "novel" contribution, the 3D token concatenation strategy, is a well-established technique for conditioning in other generative domains (e.g., 2D inpainting and image editing). While the authors claim to be the first to apply this to the 3D domain, this is more of a successful adaptation than
- The high-level ideas of the method are very easy to understand. - Fig.2 is visually nice. - A 3D editing dataset is proposed for training.
- **Crucial.** There is no video or multi-view presentation of the results. **My reviewing score will be 0 without these representations.** - Each result is only represented in a single-view image, which does not at all show clues about 3D consistency or global appearance. - 3D editing is a task highly dependent on qualitative results. All the quantitative metrics evaluate aspects that are only tangentially related, without considering the crucial aspects of 3D consistency and visual
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Interactive and Immersive Displays · 3D Printing in Biomedical Research
