A Single Image and Multimodality Is All You Need for Novel View Synthesis

Amirhosein Javadi; Chi-Shiang Gau; Konstantinos D. Polyzos; Tara Javidi

arXiv:2602.17909·cs.CV·April 20, 2026

A Single Image and Multimodality Is All You Need for Novel View Synthesis

Amirhosein Javadi, Chi-Shiang Gau, Konstantinos D. Polyzos, Tara Javidi

PDF

1 Repo

TL;DR

This paper enhances single-image novel view synthesis by integrating sparse multimodal range data, like radar or LiDAR, into diffusion models to improve geometric accuracy and visual quality.

Contribution

It introduces a multimodal depth reconstruction framework using sparse range data with Gaussian Processes, improving view synthesis without altering existing generative models.

Findings

01

Replacing monocular depth with sparse range-based depth improves visual quality.

02

The approach enhances geometric consistency in novel view generation.

03

Sparse multimodal data significantly benefits diffusion-based view synthesis.

Abstract

Diffusion-based approaches have recently demonstrated strong performance for single-image novel view synthesis by conditioning generative models on geometry inferred from monocular depth estimation. However, in practice, the quality and consistency of the synthesized views are fundamentally limited by the reliability of the underlying depth estimates, which are often fragile under low-texture, adverse weather, and occlusion-heavy real-world conditions. In this work, we show that incorporating sparse multimodal range measurements provides a simple yet effective way to overcome these limitations. We introduce a multimodal depth reconstruction framework that leverages extremely sparse range sensing data, such as automotive radar or LiDAR, to produce dense depth maps that serve as robust geometric conditioning for diffusion-based novel view synthesis. Our approach models depth in an angular…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

importAmir/MultiModalNVS
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.