Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation

Min-Seop Kwak; Junho Kim; Sangdoo Yun; Dongyoon Han; Taekyung Kim; Seungryong Kim; Jin-Hwa Kim

arXiv:2506.11924·cs.CV·February 9, 2026

Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation

Min-Seop Kwak, Junho Kim, Sangdoo Yun, Dongyoon Han, Taekyung Kim, Seungryong Kim, Jin-Hwa Kim

PDF

Open Access 1 Models 3 Reviews

TL;DR

This paper presents a diffusion-based framework for aligned novel view synthesis that combines image and geometry generation using cross-modal attention distillation and geometry-aware conditioning, enabling high-fidelity 3D scene reconstruction.

Contribution

It introduces a novel cross-modal attention distillation technique and proximity-based mesh conditioning for improved aligned view synthesis and geometry prediction.

Findings

01

Achieves high-fidelity extrapolative view synthesis on unseen scenes.

02

Produces geometrically aligned colored point clouds for 3D completion.

03

Delivers competitive reconstruction quality in interpolation settings.

Abstract

We introduce a diffusion-based framework that performs aligned novel view image and geometry generation via a warping-and-inpainting methodology. Unlike prior methods that require dense posed images or pose-embedded generative models limited to in-domain views, our method leverages off-the-shelf geometry predictors to predict partial geometries viewed from reference images, and formulates novel-view synthesis as an inpainting task for both image and geometry. To ensure accurate alignment between generated images and geometry, we propose cross-modal attention distillation, where attention maps from the image diffusion branch are injected into a parallel geometry diffusion branch during both training and inference. This multi-task approach achieves synergistic effects, facilitating geometrically robust image synthesis as well as well-defined geometry prediction. We further introduce…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

- Novel and technically elegant contribution: The cross-modal attention instillation mechanism is conceptually clean and effectively bridges the gap between image synthesis and geometry completion. It’s a natural extension of diffusion-based correspondence learning and could influence future cross-domain conditioning designs. - Comprehensive evaluation: The experiments cover both extrapolative and interpolative regimes, multiple datasets (Co3D, DTU, RealEstate10K), and ablation analyses (Table

Weaknesses

- Limited exploration of generalization across domains and semantics: It remains unclear whether the approach scales to in-the-wild scenes (e.g., urban/street-level data) or semantic diversity (humans, animals, etc.). - Dependence on pretrained geometry estimators (e.g., VGGT, Marigold): The framework relies heavily on the quality of external predictors. Errors in these priors can propagate, which partially undermines the claim of being “pose-free” or “self-contained.” - Lack of detailed compa

Reviewer 02Rating 6Confidence 4

Strengths

- Joint diffusion for images and geometry with cross-modal attention instillation (MoAI) is a simple, original coupling; using image attention to guide geometry and vice versa is creative, and proximity-based mesh conditioning is a practical improvement over raw pointmaps. - Empirical results are strong in extrapolative settings on DTU and RealEstate10K, with clear additive gains in ablations (pointmap → mesh → MoAI) and the ability to benefit from more input views at test time despite two-view

Weaknesses

- Reliance on off-the-shelf geometry/pose predictors The pipeline depends on VGGT/MASt3R-style predictors for both training supervision (pseudo GT) and inference conditioning. This creates a ceiling tied to those models’ biases and errors, and complicates fairness: improvements may partially reflect better use of VGGT rather than intrinsic advances. Please include: (i) a robustness study under degraded pointmaps/poses (noise, sparsity, biased scale), (ii) an alternative backbone (e.g., DUSt3R

Reviewer 03Rating 4Confidence 4

Strengths

* The paper is well structured and easy to follow. * Experiments on Co3D and DTU datasets demonstrate the effectiveness of the proposed method. * The cross-modal attention installation is interesting to me and has been shown to be effective via the ablation study.

Weaknesses

* The integration of diffusion priors to help NVS has been extensively explored in recent works, like diffusion-aided NeRF/3DGS reconstruction (Deceptive-NeRF/3DGS [Liu et al.]) and unified frameworks such as ReconX, ViewCrafter, and ZeroNVS. Moreover, the proximity with mesh conditioning is quite standard. Although the MoAI is interesting, it cannot support the whole paper for a top conference. * The paper compares primarily against classical or earlier NVS methods (e.g., NeRF variants), but

Code & Models

Models

🤗
minseop-kwak/moai-checkpoints
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection · Advanced Vision and Imaging

MethodsInpainting · Diffusion