MV-RAG: Retrieval Augmented Multiview Diffusion

Yosef Dayani; Omer Benishu; Sagie Benaim

arXiv:2508.16577·cs.CV·August 25, 2025

MV-RAG: Retrieval Augmented Multiview Diffusion

Yosef Dayani, Omer Benishu, Sagie Benaim

PDF

1 Models 1 Datasets 3 Reviews

TL;DR

MV-RAG enhances text-to-3D generation by retrieving relevant 2D images and conditioning a multiview diffusion model, significantly improving out-of-domain concept handling, 3D consistency, and realism.

Contribution

The paper introduces MV-RAG, a retrieval-augmented pipeline with a novel training strategy for better out-of-domain concept synthesis in text-to-3D generation.

Findings

01

Improves 3D consistency and photorealism for OOD concepts.

02

Outperforms state-of-the-art methods on challenging OOD prompts.

03

Maintains competitive performance on standard benchmarks.

Abstract

Text-to-3D generation approaches have advanced significantly by leveraging pretrained 2D diffusion priors, producing high-quality and 3D-consistent outputs. However, they often fail to produce out-of-domain (OOD) or rare concepts, yielding inconsistent or inaccurate results. To this end, we propose MV-RAG, a novel text-to-3D pipeline that first retrieves relevant 2D images from a large in-the-wild 2D database and then conditions a multiview diffusion model on these images to synthesize consistent and accurate multiview outputs. Training such a retrieval-conditioned model is achieved via a novel hybrid strategy bridging structured multiview data and diverse 2D image collections. This involves training on multiview data using augmented conditioning views that simulate retrieval variance for view-specific reconstruction, alongside training on sets of retrieved real-world 2D images using a…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

- The paper is very well written, organized, and easy to follow. - The paper is well-motivated, tackling an important problem of OOD generation or "rare" concepts that were not sufficiently trained to diffusion models, therefore yielding suboptimal training results when applied to 3D generation, which is a practical and prevalent problem in multiview diffusion models. - The paper proposes a coherent and intuitive methodology that fits well to their problem at hand: they start from the MVDream a

Weaknesses

- The paper could benefit from additional explanation on how it conducts semantic/geometric augmentation at a 3D training setting, and additional details regarding how much augmentation the model can take. For example, if the semantically augmented retrieval images deviate too much from the source 3D asset, the model training may suffer degradation rather than learning 3D consistency: please elaborate on this aspect of the method. - One question that I have with this method is that 2D image gene

Reviewer 02Rating 4Confidence 2

Strengths

Quality & Originality: The primary strength of this paper lies in its exceptionally comprehensive and multi-faceted experimental evaluation, which leaves little room for doubt regarding the efficacy of the proposed MV-RAG framework. The experimental section is thorough and systematic, and the experiments are designed to systematically validate the method's performance across a wide range of scenarios and against numerous strong baselines. (1) The paper includes detailed ablation experiments on

Weaknesses

(1) Ablating the distinct roles of 2D mode and 3D mode (Section 4.3) The qualitative ablation in Figure 7 effectively illustrates the distinct roles of the 2D and 3D training modes. The accompanying text (lines 458-461) states that the 2D mode is crucial for separating objects from in-the-wild backgrounds, while the 3D mode is vital for correct shape rendering and background consistency. However, the described effects that are specifically concerning background separation and shape correctness c

Reviewer 03Rating 6Confidence 2

Strengths

The application of retrieval-augmented generation to 2D–3D diffusion is novel and well-motivated. The proposed hybrid 2D–3D training scheme is technically sound and effectively integrates structured and unstructured supervision. The paper is thorough and well-organized, featuring extensive ablation studies, user evaluations, and clear architectural details.

Weaknesses

Although qualitative results are strong, the quantitative improvements on standard metrics (e.g., PSNR, IS) are modest—noticeable but not substantial. The evaluation could benefit from comparisons with more diverse and standardized baselines such as Wonder3D or SyncDreamer.

Code & Models

Models

🤗
yosepyossi/mvrag
model· 14 dl· ♡ 1
14 dl♡ 1

Datasets

yosepyossi/OOD-Eval
dataset· 8 dl
8 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.