Visual Geometry Grounded Deep Structure From Motion
Jianyuan Wang, Nikita Karaev, Christian Rupprecht, David Novotny

TL;DR
This paper introduces VGGSfM, a fully differentiable deep learning pipeline for structure-from-motion that improves camera pose and 3D reconstruction accuracy by integrating end-to-end training with novel mechanisms.
Contribution
It presents a novel end-to-end differentiable SfM pipeline that replaces traditional non-differentiable steps with learnable modules, enabling more accurate and efficient 3D reconstruction.
Findings
Achieves state-of-the-art results on CO3D, IMC Phototourism, and ETH3D datasets.
Eliminates the need for pairwise keypoint matching through pixel-accurate tracking.
Simultaneously recovers all camera poses and triangulates 3D points in a differentiable framework.
Abstract
Structure-from-motion (SfM) is a long-standing problem in the computer vision community, which aims to reconstruct the camera poses and 3D structure of a scene from a set of unconstrained 2D images. Classical frameworks solve this problem in an incremental manner by detecting and matching keypoints, registering images, triangulating 3D points, and conducting bundle adjustment. Recent research efforts have predominantly revolved around harnessing the power of deep learning techniques to enhance specific elements (e.g., keypoint matching), but are still based on the original, non-differentiable pipeline. Instead, we propose a new deep pipeline VGGSfM, where each component is fully differentiable and thus can be trained in an end-to-end manner. To this end, we introduce new mechanisms and simplifications. First, we build on recent advances in deep 2D point tracking to extract reliable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Robotics and Sensor-Based Localization · 3D Surveying and Cultural Heritage
MethodsSparse Evolutionary Training
