RefEgo: Referring Expression Comprehension Dataset from First-Person   Perception of Ego4D

Shuhei Kurita; Naoki Katsura; Eri Onami

arXiv:2308.12035·cs.CV·October 31, 2023·2 cites

RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D

Shuhei Kurita, Naoki Katsura, Eri Onami

PDF

Open Access 1 Repo 1 Video

TL;DR

RefEgo is a new large-scale dataset derived from egocentric videos in Ego4D, designed to advance the understanding of referring expressions in first-person perspectives for real-world applications.

Contribution

The paper introduces RefEgo, a comprehensive video-based referring expression dataset from egocentric views, filling a gap in real-world, first-person perception data for grounding language in visual scenes.

Findings

01

Combines state-of-the-art models with object tracking for improved performance

02

Achieves effective object tracking even when objects are out-of-frame or multiple similar objects are present

03

Demonstrates the dataset's potential for advancing first-person referring expression comprehension

Abstract

Grounding textual expressions on scene objects from first-person views is a truly demanding capability in developing agents that are aware of their surroundings and behave following intuitive text instructions. Such capability is of necessity for glass-devices or autonomous robots to localize referred objects in the real-world. In the conventional referring expression comprehension tasks of images, however, datasets are mostly constructed based on the web-crawled data and don't reflect diverse real-world structures on the task of grounding textual expressions in diverse objects in the real world. Recently, a massive-scale egocentric video dataset of Ego4D was proposed. Ego4D covers around the world diverse real-world scenes including numerous indoor and outdoor situations such as shopping, cooking, walking, talking, manufacturing, etc. Based on egocentric videos of Ego4D, we constructed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shuheikurita/refego
pytorchOfficial

Videos

RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Human Pose and Action Recognition

MethodsAttentive Walk-Aggregating Graph Neural Network