DiffusionSfM: Predicting Structure and Motion via Ray Origin and   Endpoint Diffusion

Qitao Zhao; Amy Lin; Jeff Tan; Jason Y. Zhang; Deva Ramanan; Shubham; Tulsiani

arXiv:2505.05473·cs.CV·May 9, 2025

DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion

Qitao Zhao, Amy Lin, Jeff Tan, Jason Y. Zhang, Deva Ramanan, Shubham, Tulsiani

PDF

Open Access

TL;DR

DiffusionSfM introduces a novel diffusion-based approach for direct multi-view 3D scene reconstruction and camera pose estimation, outperforming traditional methods and modeling uncertainty.

Contribution

It presents a transformer-based diffusion model that directly predicts scene geometry and camera poses from multi-view images, bypassing traditional pipeline stages.

Findings

01

Outperforms classical SfM methods on synthetic datasets.

02

Achieves superior results compared to existing learning-based approaches.

03

Effectively models uncertainty in 3D reconstruction and pose estimation.

Abstract

Current Structure-from-Motion (SfM) methods typically follow a two-stage pipeline, combining learned or geometric pairwise reasoning with a subsequent global optimization step. In contrast, we propose a data-driven multi-view reasoning approach that directly infers 3D scene geometry and camera poses from multi-view images. Our framework, DiffusionSfM, parameterizes scene geometry and cameras as pixel-wise ray origins and endpoints in a global frame and employs a transformer-based denoising diffusion model to predict them from multi-view inputs. To address practical challenges in training diffusion models with missing data and unbounded scene coordinates, we introduce specialized mechanisms that ensure robust learning. We empirically validate DiffusionSfM on both synthetic and real datasets, demonstrating that it outperforms classical and learning-based approaches while naturally…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Advanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis

MethodsDiffusion