ObjectRelator: Enabling Cross-View Object Relation Understanding Across Ego-Centric and Exo-Centric Perspectives
Yuqian Fu, Runze Wang, Bin Ren, Guolei Sun, Biao Gong, Yanwei Fu, Danda Pani Paudel, Xuanjing Huang, Luc Van Gool

TL;DR
ObjectRelator is a novel method that improves cross-view object segmentation by integrating language cues and self-supervised alignment, significantly advancing understanding of object relations across ego-centric and exo-centric perspectives.
Contribution
It introduces two modules, MCFuse and XObjAlign, that incorporate language and self-supervision to enhance cross-view object segmentation accuracy.
Findings
Achieves state-of-the-art results on Ego-Exo4D and HANDAL-X datasets.
Effectively handles complex backgrounds and significant viewpoint changes.
Demonstrates robustness to object appearance variations.
Abstract
Bridging the gap between ego-centric and exo-centric views has been a long-standing question in computer vision. In this paper, we focus on the emerging Ego-Exo object correspondence task, which aims to understand object relations across ego-exo perspectives through segmentation. While numerous segmentation models have been proposed, most operate on a single image (view), making them impractical for cross-view scenarios. PSALM, a recently proposed segmentation method, stands out as a notable exception with its demonstrated zero-shot ability on this task. However, due to the drastic viewpoint change between ego and exo, PSALM fails to accurately locate and segment objects, especially in complex backgrounds or when object appearances change significantly. To address these issues, we propose ObjectRelator, a novel approach featuring two key modules: Multimodal Condition Fusion (MCFuse) and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Human Pose and Action Recognition · Advanced Vision and Imaging
MethodsFocus
