C3Po: Cross-View Cross-Modality Correspondence by Pointmap Prediction
Kuan Wei Huang, Brandon Li, Bharath Hariharan, Noah Snavely

TL;DR
This paper introduces C3, a new large-scale dataset for cross-view, cross-modality correspondence between ground photos and floor plans, and proposes a method to improve correspondence prediction in challenging scenarios.
Contribution
The paper presents a novel dataset C3 with 90K paired images and floor plans, and demonstrates improved correspondence prediction by training models on this data.
Findings
State-of-the-art models perform poorly on cross-modal correspondence tasks.
Training on C3 dataset improves RMSE by 34%.
Predicted correspondences enable camera pose estimation.
Abstract
Geometric models like DUSt3R have shown great advances in understanding the geometry of a scene from pairs of photos. However, they fail when the inputs are from vastly different viewpoints (e.g., aerial vs. ground) or modalities (e.g., photos vs. abstract drawings) compared to what was observed during training. This paper addresses a challenging version of this problem: predicting correspondences between ground-level photos and floor plans. Current datasets for joint photo-floor plan reasoning are limited, either lacking in varying modalities (VIGOR) or lacking in correspondences (WAFFLE). To address these limitations, we introduce a new dataset, C3, created by first reconstructing a number of scenes in 3D from Internet photo collections via structure-from-motion, then manually registering the reconstructions to floor plans gathered from the Internet, from which we can derive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Vision and Imaging · Robotics and Sensor-Based Localization · Advanced Image and Video Retrieval Techniques
