Self-Supervised Spatial Correspondence Across Modalities
Ayush Shrivastava, Andrew Owens

TL;DR
This paper introduces a self-supervised method for finding spatial correspondences across different visual modalities, such as RGB, depth, and thermal images, without requiring labeled data or aligned pairs.
Contribution
It extends the contrastive random walk framework to learn cycle-consistent features for cross-modal and intra-modal matching in a fully unsupervised manner.
Findings
Achieves strong performance on geometric and semantic correspondence benchmarks.
Effective in challenging cross-modal matching tasks like RGB-to-depth and RGB-to-thermal.
No need for spatially aligned multimodal image pairs during training.
Abstract
We present a method for finding cross-modal space-time correspondences. Given two images from different visual modalities, such as an RGB image and a depth map, our model identifies which pairs of pixels correspond to the same physical points in the scene. To solve this problem, we extend the contrastive random walk framework to simultaneously learn cycle-consistent feature representations for both cross-modal and intra-modal matching. The resulting model is simple and has no explicit photo-consistency assumptions. It can be trained entirely using unlabeled data, without the need for any spatially aligned multimodal image pairs. We evaluate our method on both geometric and semantic correspondence tasks. For geometric matching, we consider challenging tasks such as RGB-to-depth and RGB-to-thermal matching (and vice versa); for semantic matching, we evaluate on photo-sketch and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGeographic Information Systems Studies
