MV-RAG: Retrieval Augmented Multiview Diffusion
Yosef Dayani, Omer Benishu, Sagie Benaim

TL;DR
MV-RAG enhances text-to-3D generation by retrieving relevant 2D images and conditioning a multiview diffusion model, significantly improving out-of-domain concept handling, 3D consistency, and realism.
Contribution
The paper introduces MV-RAG, a retrieval-augmented pipeline with a novel training strategy for better out-of-domain concept synthesis in text-to-3D generation.
Findings
Improves 3D consistency and photorealism for OOD concepts.
Outperforms state-of-the-art methods on challenging OOD prompts.
Maintains competitive performance on standard benchmarks.
Abstract
Text-to-3D generation approaches have advanced significantly by leveraging pretrained 2D diffusion priors, producing high-quality and 3D-consistent outputs. However, they often fail to produce out-of-domain (OOD) or rare concepts, yielding inconsistent or inaccurate results. To this end, we propose MV-RAG, a novel text-to-3D pipeline that first retrieves relevant 2D images from a large in-the-wild 2D database and then conditions a multiview diffusion model on these images to synthesize consistent and accurate multiview outputs. Training such a retrieval-conditioned model is achieved via a novel hybrid strategy bridging structured multiview data and diverse 2D image collections. This involves training on multiview data using augmented conditioning views that simulate retrieval variance for view-specific reconstruction, alongside training on sets of retrieved real-world 2D images using a…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper is very well written, organized, and easy to follow. - The paper is well-motivated, tackling an important problem of OOD generation or "rare" concepts that were not sufficiently trained to diffusion models, therefore yielding suboptimal training results when applied to 3D generation, which is a practical and prevalent problem in multiview diffusion models. - The paper proposes a coherent and intuitive methodology that fits well to their problem at hand: they start from the MVDream a
- The paper could benefit from additional explanation on how it conducts semantic/geometric augmentation at a 3D training setting, and additional details regarding how much augmentation the model can take. For example, if the semantically augmented retrieval images deviate too much from the source 3D asset, the model training may suffer degradation rather than learning 3D consistency: please elaborate on this aspect of the method. - One question that I have with this method is that 2D image gene
Quality & Originality: The primary strength of this paper lies in its exceptionally comprehensive and multi-faceted experimental evaluation, which leaves little room for doubt regarding the efficacy of the proposed MV-RAG framework. The experimental section is thorough and systematic, and the experiments are designed to systematically validate the method's performance across a wide range of scenarios and against numerous strong baselines. (1) The paper includes detailed ablation experiments on
(1) Ablating the distinct roles of 2D mode and 3D mode (Section 4.3) The qualitative ablation in Figure 7 effectively illustrates the distinct roles of the 2D and 3D training modes. The accompanying text (lines 458-461) states that the 2D mode is crucial for separating objects from in-the-wild backgrounds, while the 3D mode is vital for correct shape rendering and background consistency. However, the described effects that are specifically concerning background separation and shape correctness c
The application of retrieval-augmented generation to 2D–3D diffusion is novel and well-motivated. The proposed hybrid 2D–3D training scheme is technically sound and effectively integrates structured and unstructured supervision. The paper is thorough and well-organized, featuring extensive ablation studies, user evaluations, and clear architectural details.
Although qualitative results are strong, the quantitative improvements on standard metrics (e.g., PSNR, IS) are modest—noticeable but not substantial. The evaluation could benefit from comparisons with more diverse and standardized baselines such as Wonder3D or SyncDreamer.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
