Egocentric Video-Language Pretraining

Kevin Qinghong Lin; Alex Jinpeng Wang; Mattia Soldan; Michael Wray,; Rui Yan; Eric Zhongcong Xu; Difei Gao; Rongcheng Tu; Wenzhe Zhao; Weijie; Kong; Chengfei Cai; Hongfa Wang; Dima Damen; Bernard Ghanem; Wei Liu; Mike; Zheng Shou

arXiv:2206.01670·cs.CV·October 14, 2022·45 cites

Egocentric Video-Language Pretraining

Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Soldan, Michael Wray,, Rui Yan, Eric Zhongcong Xu, Difei Gao, Rongcheng Tu, Wenzhe Zhao, Weijie, Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, Mike, Zheng Shou

PDF

Open Access 2 Repos 1 Datasets

TL;DR

This paper pioneers egocentric video-language pretraining by creating a large-scale dataset, proposing a novel contrastive learning method, and establishing a benchmark, leading to improved performance on various egocentric tasks.

Contribution

It introduces EgoClip, EgoNCE, and EgoMCQ, enabling effective egocentric video-language representation learning and evaluation.

Findings

01

Strong performance on five egocentric downstream tasks

02

Effective validation and exploration using EgoMCQ benchmark

03

Demonstrated the effectiveness of EgoNCE contrastive learning

Abstract

Video-Language Pretraining (VLP), which aims to learn transferable representation to advance a wide range of video-text downstream tasks, has recently received increasing attention. Best performing works rely on large-scale, 3rd-person video-text datasets, such as HowTo100M. In this work, we exploit the recently released Ego4D dataset to pioneer Egocentric VLP along three directions. (i) We create EgoClip, a 1st-person video-text pretraining dataset comprising 3.8M clip-text pairs well-chosen from Ego4D, covering a large variety of human daily activities. (ii) We propose a novel pretraining objective, dubbed EgoNCE, which adapts video-text contrastive learning to the egocentric domain by mining egocentric-aware positive and negative samples. (iii) We introduce EgoMCQ, a development benchmark that is close to EgoClip and hence can support effective validation and fast exploration of our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

AlanaAI/EVUD
dataset· 90 dl
90 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Cancer-related molecular mechanisms research

MethodsContrastive Learning