Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction

Shannan Yan; Leqi Zheng; Keyu Lv; Jingchen Ni; Hongyang Wei; Jiajun Zhang; Guangting Wang; Jing Lyu; Chun Yuan; Fengyun Rao

arXiv:2602.18996·cs.CV·March 26, 2026

Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction

Shannan Yan, Leqi Zheng, Keyu Lv, Jingchen Ni, Hongyang Wei, Jiajun Zhang, Guangting Wang, Jing Lyu, Chun Yuan, Fengyun Rao

PDF

Open Access

TL;DR

This paper presents a cycle-consistent mask prediction framework for establishing object correspondence across different viewpoints in videos, leveraging self-supervised training and test-time adaptation to achieve state-of-the-art results.

Contribution

It introduces a cycle-consistency training method for view-invariant object correspondence without ground-truth labels, enabling effective test-time training.

Findings

01

Achieves state-of-the-art results on Ego-Exo4D and HANDAL-X benchmarks.

02

Cycle-consistency training improves robustness and view-invariance.

03

Test-time training enhances performance during inference.

Abstract

We study the task of establishing object-level visual correspondence across different viewpoints in videos, focusing on the challenging egocentric-to-exocentric and exocentric-to-egocentric scenarios. We propose a simple yet effective framework based on conditional binary segmentation, where an object query mask is encoded into a latent representation to guide the localization of the corresponding object in a target video. To encourage robust, view-invariant representations, we introduce a cycle-consistency training objective: the predicted mask in the target view is projected back to the source view to reconstruct the original query mask. This bidirectional constraint provides a strong self-supervisory signal without requiring ground-truth annotations and enables test-time training (TTT) at inference. Experiments on the Ego-Exo4D and HANDAL-X benchmarks demonstrate the effectiveness of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning