Seeing Across Skies and Streets: Feedforward 3D Reconstruction from Satellite, Drone, and Ground Images
Qiwei Wang, Zhongyao Tuo, Xianghui Ze, Yujiao Shi

TL;DR
This paper introduces Cross3R, a feedforward model that uses satellite, UAV, and ground images to reconstruct 3D scenes and estimate camera poses, overcoming limitations of traditional 3-DoF localization.
Contribution
The paper presents Cross3R, a novel model that integrates multi-view images for accurate 3D reconstruction and pose estimation without known relative poses.
Findings
Cross3R outperforms existing feed-forward baselines in 3D reconstruction and localization.
Cross3R surpasses dedicated cross-view methods on KITTI without training on it.
The CrossGeo dataset contains 278K images across 85 diverse scenes.
Abstract
Cross-view localization classically asks: where does this ground image lie on the satellite tile? Existing methods are typically limited to 3-DoF estimates -- an position and a yaw angle -- because nadir satellite imagery provides no direct cues for roll, pitch, or altitude, forcing a reliance on planar-motion and zero-tilt assumptions. These assumptions break on real terrain with slopes, ramps, and tilted camera mounts. To overcome this, we introduce a single UAV image as an intermediate viewpoint: it reveals the 3D structure invisible from nadir, supplies the cues for roll, pitch, and altitude that the satellite alone cannot provide, and needs only spatial overlap with the ground camera -- no known relative pose is required. Building on this insight, we propose **Cross3R**, a flexible feed-forward model that ingests a satellite tile together with a UAV image, a ground image,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
