C3Po: Cross-View Cross-Modality Correspondence by Pointmap Prediction

Kuan Wei Huang; Brandon Li; Bharath Hariharan; Noah Snavely

arXiv:2511.18559·cs.CV·January 13, 2026

C3Po: Cross-View Cross-Modality Correspondence by Pointmap Prediction

Kuan Wei Huang, Brandon Li, Bharath Hariharan, Noah Snavely

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces C3, a new large-scale dataset for cross-view, cross-modality correspondence between ground photos and floor plans, and proposes a method to improve correspondence prediction in challenging scenarios.

Contribution

The paper presents a novel dataset C3 with 90K paired images and floor plans, and demonstrates improved correspondence prediction by training models on this data.

Findings

01

State-of-the-art models perform poorly on cross-modal correspondence tasks.

02

Training on C3 dataset improves RMSE by 34%.

03

Predicted correspondences enable camera pose estimation.

Abstract

Geometric models like DUSt3R have shown great advances in understanding the geometry of a scene from pairs of photos. However, they fail when the inputs are from vastly different viewpoints (e.g., aerial vs. ground) or modalities (e.g., photos vs. abstract drawings) compared to what was observed during training. This paper addresses a challenging version of this problem: predicting correspondences between ground-level photos and floor plans. Current datasets for joint photo-floor plan reasoning are limited, either lacking in varying modalities (VIGOR) or lacking in correspondences (WAFFLE). To address these limitations, we introduce a new dataset, C3, created by first reconstructing a number of scenes in 3D from Internet photo collections via structure-from-motion, then manually registering the reconstructions to floor plans gathered from the Internet, from which we can derive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

kwhuang/C3
dataset· 352 dl
352 dl

Videos

C3Po: Cross-View Cross-Modality Correspondence by Pointmap Prediction· slideslive

Taxonomy

TopicsAdvanced Vision and Imaging · Robotics and Sensor-Based Localization · Advanced Image and Video Retrieval Techniques