Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning
Zi-Yi Dou, Xitong Yang, Tushar Nagarajan, Huiyu Wang, Jing Huang,, Nanyun Peng, Kris Kitani, Fu-Jen Chu

TL;DR
This paper introduces EMBED, a method that transforms exocentric video-language data into egocentric data, enabling improved egocentric video representation learning and achieving state-of-the-art results on multiple benchmarks.
Contribution
The paper presents a novel data transformation framework that adapts exocentric video-language data for egocentric learning, addressing disparities between the two data types.
Findings
Achieved 4.7% improvement on Epic-Kitchens-100 retrieval
Achieved 6.2% improvement on EGTEA classification in zero-shot
Demonstrated strong generalization across various exocentric datasets
Abstract
We present EMBED (Egocentric Models Built with Exocentric Data), a method designed to transform exocentric video-language data for egocentric video representation learning. Large-scale exocentric data covers diverse activities with significant potential for egocentric learning, but inherent disparities between egocentric and exocentric data pose challenges in utilizing one view for the other seamlessly. Egocentric videos predominantly feature close-up hand-object interactions, whereas exocentric videos offer a broader perspective on human activities. Additionally, narratives in egocentric datasets are typically more action-centric and closely linked with the visual content, in contrast to the narrative styles found in exocentric datasets. To address these challenges, we employ a data transformation framework to adapt exocentric data for egocentric training, focusing on identifying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition
MethodsALIGN
