ObjectRelator: Enabling Cross-View Object Relation Understanding Across Ego-Centric and Exo-Centric Perspectives

Yuqian Fu; Runze Wang; Bin Ren; Guolei Sun; Biao Gong; Yanwei Fu; Danda Pani Paudel; Xuanjing Huang; Luc Van Gool

arXiv:2411.19083·cs.CV·July 28, 2025

ObjectRelator: Enabling Cross-View Object Relation Understanding Across Ego-Centric and Exo-Centric Perspectives

Yuqian Fu, Runze Wang, Bin Ren, Guolei Sun, Biao Gong, Yanwei Fu, Danda Pani Paudel, Xuanjing Huang, Luc Van Gool

PDF

Open Access 2 Models

TL;DR

ObjectRelator is a novel method that improves cross-view object segmentation by integrating language cues and self-supervised alignment, significantly advancing understanding of object relations across ego-centric and exo-centric perspectives.

Contribution

It introduces two modules, MCFuse and XObjAlign, that incorporate language and self-supervision to enhance cross-view object segmentation accuracy.

Findings

01

Achieves state-of-the-art results on Ego-Exo4D and HANDAL-X datasets.

02

Effectively handles complex backgrounds and significant viewpoint changes.

03

Demonstrates robustness to object appearance variations.

Abstract

Bridging the gap between ego-centric and exo-centric views has been a long-standing question in computer vision. In this paper, we focus on the emerging Ego-Exo object correspondence task, which aims to understand object relations across ego-exo perspectives through segmentation. While numerous segmentation models have been proposed, most operate on a single image (view), making them impractical for cross-view scenarios. PSALM, a recently proposed segmentation method, stands out as a notable exception with its demonstrated zero-shot ability on this task. However, due to the drastic viewpoint change between ego and exo, PSALM fails to accurately locate and segment objects, especially in complex backgrounds or when object appearances change significantly. To address these issues, we propose ObjectRelator, a novel approach featuring two key modules: Multimodal Condition Fusion (MCFuse) and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Human Pose and Action Recognition · Advanced Vision and Imaging

MethodsFocus