RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D
Shuhei Kurita, Naoki Katsura, Eri Onami

TL;DR
RefEgo is a new large-scale dataset derived from egocentric videos in Ego4D, designed to advance the understanding of referring expressions in first-person perspectives for real-world applications.
Contribution
The paper introduces RefEgo, a comprehensive video-based referring expression dataset from egocentric views, filling a gap in real-world, first-person perception data for grounding language in visual scenes.
Findings
Combines state-of-the-art models with object tracking for improved performance
Achieves effective object tracking even when objects are out-of-frame or multiple similar objects are present
Demonstrates the dataset's potential for advancing first-person referring expression comprehension
Abstract
Grounding textual expressions on scene objects from first-person views is a truly demanding capability in developing agents that are aware of their surroundings and behave following intuitive text instructions. Such capability is of necessity for glass-devices or autonomous robots to localize referred objects in the real-world. In the conventional referring expression comprehension tasks of images, however, datasets are mostly constructed based on the web-crawled data and don't reflect diverse real-world structures on the task of grounding textual expressions in diverse objects in the real world. Recently, a massive-scale egocentric video dataset of Ego4D was proposed. Ego4D covers around the world diverse real-world scenes including numerous indoor and outdoor situations such as shopping, cooking, walking, talking, manufacturing, etc. Based on egocentric videos of Ego4D, we constructed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Human Pose and Action Recognition
MethodsAttentive Walk-Aggregating Graph Neural Network
