Cross-View Completion Models are Zero-shot Correspondence Estimators
Honggyu An, Jinhyeon Kim, Seonghoon Park, Jaewoo Jung, Jisang Han,, Sunghwan Hong, Seungryong Kim

TL;DR
This paper introduces a novel approach where cross-view completion models leverage cross-attention maps to effectively estimate correspondences in a zero-shot setting, improving tasks like geometric matching and depth estimation.
Contribution
It demonstrates that cross-attention maps in cross-view completion models better capture correspondence than other features, advancing zero-shot correspondence estimation.
Findings
Cross-attention maps outperform other features in correspondence tasks.
The method achieves state-of-the-art results in zero-shot matching.
Effective in geometric matching and multi-frame depth estimation.
Abstract
In this work, we explore new perspectives on cross-view completion learning by drawing an analogy to self-supervised correspondence learning. Through our analysis, we demonstrate that the cross-attention map within cross-view completion models captures correspondence more effectively than other correlations derived from encoder or decoder features. We verify the effectiveness of the cross-attention map by evaluating on both zero-shot matching and learning-based geometric matching and multi-frame depth estimation. Project page is available at https://cvlab-kaist.github.io/ZeroCo/.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference
