InstructPix2NeRF: Instructed 3D Portrait Editing from a Single Image
Jianhui Li, Shilong Liu, Zidong Liu, Yikai Wang, Kaiwen Zheng, Jinghui, Xu, Jianmin Li, Jun Zhu

TL;DR
InstructPix2NeRF introduces an end-to-end diffusion-based framework for 3D portrait editing from a single image using natural language instructions, achieving multi-semantic edits while preserving identity and 3D consistency.
Contribution
The paper presents a novel diffusion-based approach with a token position randomization strategy and identity consistency module for instructed 3D portrait editing from a single image.
Findings
Outperforms strong baselines quantitatively and qualitatively.
Enables multi-semantic editing with a single pass.
Maintains high 3D identity consistency.
Abstract
With the success of Neural Radiance Field (NeRF) in 3D-aware portrait editing, a variety of works have achieved promising results regarding both quality and 3D consistency. However, these methods heavily rely on per-prompt optimization when handling natural language as editing instructions. Due to the lack of labeled human face 3D datasets and effective architectures, the area of human-instructed 3D-aware editing for open-world portraits in an end-to-end manner remains under-explored. To solve this problem, we propose an end-to-end diffusion-based framework termed InstructPix2NeRF, which enables instructed 3D-aware portrait editing from a single open-world image with human instructions. At its core lies a conditional latent 3D diffusion process that lifts 2D editing to 3D space by learning the correlation between the paired images' difference and the instructions via triplet data. With…
Peer Reviews
Decision·ICLR 2024 poster
The strengths of the proposed paper can be summarized as: 1. The authors propose a token randomization strategy that can increase the model's capability for editing multiple attributes simultaneously. 2. An identity-preserving module is proposed to guide the editing process and present the original identity in the final outcomes. 3. The proposed method is reported to be time-friendly, producing the results in few seconds.
The weaknesses of the proposed method can be summarized as: 1. Through the visualization in Figure 1, I find that the original identity and RGB image attributes are not well preserved. Large differences can still be observed in the areas that are not supposed to be edited. 2. Qualitative comparisons. (1) The proposed method seems to struggle with expression editing, e.g., it fails to make the head smiling; The instruct-pix2pix model doesn't encounter this problem; (2) Regarding the "bangs" examp
1) While each individual component of the method isn’t novel, the whole method itself is 2) Qualitative results in both the paper and appendix demonstrate plausible editing, though some identity loss remains 3) Quantitative results demonstrate that the method better preserves the identity across edits. The user study additionally bolsters the main contribution of the paper.
1) The methods section could be written better, with a clear exposition of losses during training and the forward pass during inference. To that end, Fig 2 should be expanded to include both training and inference settings. 2) While the identity consistency is better preserved that prior work, the still remains and identity drift during editing.
- It is the first (to my knowledge) paper that allows "instructed" 3D portrait editing from single images. - The experiments show that the proposed method outperforms compared baselines under the authors' settings.
- The results shown in the paper lack race diversity. There are almost no Asian or black people. I'm worried whether the proposed method does not perform well on those cases. - The identity may change after applying the proposed method. For example, in Fig. 1 first example, the eye shape changed after the beard was removed. In Fig. 3 middle example, the girl seems to look more Asian and the nose shape changed after editing. These are not analyzed in the limitation section. - The proposed metho
Code & Models
Videos
Taxonomy
Topics3D Shape Modeling and Analysis · Advanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis
MethodsDiffusion
