Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation
Min-Seop Kwak, Junho Kim, Sangdoo Yun, Dongyoon Han, Taekyung Kim, Seungryong Kim, Jin-Hwa Kim

TL;DR
This paper presents a diffusion-based framework for aligned novel view synthesis that combines image and geometry generation using cross-modal attention distillation and geometry-aware conditioning, enabling high-fidelity 3D scene reconstruction.
Contribution
It introduces a novel cross-modal attention distillation technique and proximity-based mesh conditioning for improved aligned view synthesis and geometry prediction.
Findings
Achieves high-fidelity extrapolative view synthesis on unseen scenes.
Produces geometrically aligned colored point clouds for 3D completion.
Delivers competitive reconstruction quality in interpolation settings.
Abstract
We introduce a diffusion-based framework that performs aligned novel view image and geometry generation via a warping-and-inpainting methodology. Unlike prior methods that require dense posed images or pose-embedded generative models limited to in-domain views, our method leverages off-the-shelf geometry predictors to predict partial geometries viewed from reference images, and formulates novel-view synthesis as an inpainting task for both image and geometry. To ensure accurate alignment between generated images and geometry, we propose cross-modal attention distillation, where attention maps from the image diffusion branch are injected into a parallel geometry diffusion branch during both training and inference. This multi-task approach achieves synergistic effects, facilitating geometrically robust image synthesis as well as well-defined geometry prediction. We further introduce…
Peer Reviews
Decision·ICLR 2026 Poster
- Novel and technically elegant contribution: The cross-modal attention instillation mechanism is conceptually clean and effectively bridges the gap between image synthesis and geometry completion. It’s a natural extension of diffusion-based correspondence learning and could influence future cross-domain conditioning designs. - Comprehensive evaluation: The experiments cover both extrapolative and interpolative regimes, multiple datasets (Co3D, DTU, RealEstate10K), and ablation analyses (Table
- Limited exploration of generalization across domains and semantics: It remains unclear whether the approach scales to in-the-wild scenes (e.g., urban/street-level data) or semantic diversity (humans, animals, etc.). - Dependence on pretrained geometry estimators (e.g., VGGT, Marigold): The framework relies heavily on the quality of external predictors. Errors in these priors can propagate, which partially undermines the claim of being “pose-free” or “self-contained.” - Lack of detailed compa
- Joint diffusion for images and geometry with cross-modal attention instillation (MoAI) is a simple, original coupling; using image attention to guide geometry and vice versa is creative, and proximity-based mesh conditioning is a practical improvement over raw pointmaps. - Empirical results are strong in extrapolative settings on DTU and RealEstate10K, with clear additive gains in ablations (pointmap → mesh → MoAI) and the ability to benefit from more input views at test time despite two-view
- Reliance on off-the-shelf geometry/pose predictors The pipeline depends on VGGT/MASt3R-style predictors for both training supervision (pseudo GT) and inference conditioning. This creates a ceiling tied to those models’ biases and errors, and complicates fairness: improvements may partially reflect better use of VGGT rather than intrinsic advances. Please include: (i) a robustness study under degraded pointmaps/poses (noise, sparsity, biased scale), (ii) an alternative backbone (e.g., DUSt3R
* The paper is well structured and easy to follow. * Experiments on Co3D and DTU datasets demonstrate the effectiveness of the proposed method. * The cross-modal attention installation is interesting to me and has been shown to be effective via the ablation study.
* The integration of diffusion priors to help NVS has been extensively explored in recent works, like diffusion-aided NeRF/3DGS reconstruction (Deceptive-NeRF/3DGS [Liu et al.]) and unified frameworks such as ReconX, ViewCrafter, and ZeroNVS. Moreover, the proximity with mesh conditioning is quite standard. Although the MoAI is interesting, it cannot support the whole paper for a top conference. * The paper compares primarily against classical or earlier NVS methods (e.g., NeRF variants), but
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection · Advanced Vision and Imaging
MethodsInpainting · Diffusion
