InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization
Daniel Gilo, Or Litany

TL;DR
InstructMix2Mix introduces a novel multi-view diffusion framework that enhances consistency and quality in sparse-view 3D scene editing guided by text, overcoming artifacts of prior methods.
Contribution
The paper presents a new multi-view diffusion model that distills 2D diffusion capabilities into a 3D-aware framework, with novel adaptations for improved cross-view consistency.
Findings
Significantly improves multi-view consistency in sparse-view editing.
Maintains high per-frame edit quality.
Outperforms existing methods in coherence and artifact reduction.
Abstract
We address the task of multi-view image editing from sparse input views, where the inputs can be seen as a mix of images capturing the scene from different viewpoints. The goal is to modify the scene according to a textual instruction while preserving consistency across all views. Existing methods, based on per-scene neural fields or temporal attention mechanisms, struggle in this setting, often producing artifacts and incoherent edits. We propose InstructMix2Mix (I-Mix2Mix), a framework that distills the editing capabilities of a 2D diffusion model into a pretrained multi-view diffusion model, leveraging its data-driven 3D prior for cross-view consistency. A key contribution is replacing the conventional neural field consolidator in Score Distillation Sampling (SDS) with a multi-view diffusion student, which requires novel adaptations: incremental student updates across timesteps, a…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The idea of distilling a 2D monocular image editing model into multi-view diffusion models using SDS loss is interesting and novel. - The experiments are extensive, and the ablation study is thorough. - The paper is clearly written and well presented.
- For 3D editing models, providing a continuous multi-view video visualization is important, as showing only selected frames in the paper is insufficient. - It would be better to include an efficiency comparison (similar to Table 1 in DGE). Also, during real 3D editing applications, is an extra step needed to convert the multi-view diffusion model into a 3D representation?
- This paper is generally easy to follow. - The proposed idea of bridging existing models through a teacher–student distillation framework is conceptually appealing and has potential for broader applicability.
1. **Motivation for Sparse-View Editing is Weak**: The motivation for addressing sparse-view editing is not sufficiently convincing. The authors assume that users often possess only a limited number of input views to edit, but this assumption is not empirically supported. It remains unclear whether sparse multi-view editing scenarios are common in real-world applications. 2. **Weak Multi-View Consistency**: The model exhibits noticeable inconsistencies across views e.g., the ear shape differenc
The paper correctly identifies the limitations arising from the need for dense image data in 3D editing tasks and proposes a method that leverages two diffusion models without relying on a 3D model. This approach reflects an interesting attempt to reduce dependency on explicit 3D supervision.
1. **Missing figure indices and captions throughout the paper.** Many figures are presented without indices or captions, particularly on pages 4, 5, and 8. In addition, the graph on page 9 also lacks both an index and a caption. These omissions significantly hinder the reader from following and understanding the paper. 2. **Naive use of SDS leads to degraded editing quality.** The paper applies SDS in a straightforward manner, which is known to produce low-quality and overly saturated results i
- The setting of sparse multi-view image editing for realistic scenes is underexplored, and as demonstrated in the paper, previous methods struggle in this scenario. The paper presents a method that successfully addresses this challenging setting. - The proposed approach is novel and well-designed. As noted by the authors, it also has the potential to be applied to other related tasks. The idea of using an SDS loss to optimize a diffusion model is intriguing. - The results look good, and the inc
My main concern is the evaluation of consistency. Although the qualitative results appear consistent, it is difficult to assess consistency based on only four frames. I believe that the CLIP consistency metric does not accurately capture the 3D consistency of the results. For example, I would expect the outputs produced by the student-only configuration to be much more 3D-consistent than those of the teacher-only configuration, since the student is explicitly trained to generate consistent views
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Computer Graphics and Visualization Techniques
