Unlocking Exocentric Video-Language Data for Egocentric Video   Representation Learning

Zi-Yi Dou; Xitong Yang; Tushar Nagarajan; Huiyu Wang; Jing Huang,; Nanyun Peng; Kris Kitani; Fu-Jen Chu

arXiv:2408.03567·cs.CV·August 8, 2024

Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning

Zi-Yi Dou, Xitong Yang, Tushar Nagarajan, Huiyu Wang, Jing Huang,, Nanyun Peng, Kris Kitani, Fu-Jen Chu

PDF

Open Access

TL;DR

This paper introduces EMBED, a method that transforms exocentric video-language data into egocentric data, enabling improved egocentric video representation learning and achieving state-of-the-art results on multiple benchmarks.

Contribution

The paper presents a novel data transformation framework that adapts exocentric video-language data for egocentric learning, addressing disparities between the two data types.

Findings

01

Achieved 4.7% improvement on Epic-Kitchens-100 retrieval

02

Achieved 6.2% improvement on EGTEA classification in zero-shot

03

Demonstrated strong generalization across various exocentric datasets

Abstract

We present EMBED (Egocentric Models Built with Exocentric Data), a method designed to transform exocentric video-language data for egocentric video representation learning. Large-scale exocentric data covers diverse activities with significant potential for egocentric learning, but inherent disparities between egocentric and exocentric data pose challenges in utilizing one view for the other seamlessly. Egocentric videos predominantly feature close-up hand-object interactions, whereas exocentric videos offer a broader perspective on human activities. Additionally, narratives in egocentric datasets are typically more action-centric and closely linked with the visual content, in contrast to the narrative styles found in exocentric datasets. To address these challenges, we employ a data transformation framework to adapt exocentric data for egocentric training, focusing on identifying…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition

MethodsALIGN