Egocentric Video-Language Pretraining
Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Soldan, Michael Wray,, Rui Yan, Eric Zhongcong Xu, Difei Gao, Rongcheng Tu, Wenzhe Zhao, Weijie, Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, Mike, Zheng Shou

TL;DR
This paper pioneers egocentric video-language pretraining by creating a large-scale dataset, proposing a novel contrastive learning method, and establishing a benchmark, leading to improved performance on various egocentric tasks.
Contribution
It introduces EgoClip, EgoNCE, and EgoMCQ, enabling effective egocentric video-language representation learning and evaluation.
Findings
Strong performance on five egocentric downstream tasks
Effective validation and exploration using EgoMCQ benchmark
Demonstrated the effectiveness of EgoNCE contrastive learning
Abstract
Video-Language Pretraining (VLP), which aims to learn transferable representation to advance a wide range of video-text downstream tasks, has recently received increasing attention. Best performing works rely on large-scale, 3rd-person video-text datasets, such as HowTo100M. In this work, we exploit the recently released Ego4D dataset to pioneer Egocentric VLP along three directions. (i) We create EgoClip, a 1st-person video-text pretraining dataset comprising 3.8M clip-text pairs well-chosen from Ego4D, covering a large variety of human daily activities. (ii) We propose a novel pretraining objective, dubbed EgoNCE, which adapts video-text contrastive learning to the egocentric domain by mining egocentric-aware positive and negative samples. (iii) We introduce EgoMCQ, a development benchmark that is close to EgoClip and hence can support effective validation and fast exploration of our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Cancer-related molecular mechanisms research
MethodsContrastive Learning
