EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the   Backbone

Shraman Pramanick; Yale Song; Sayan Nag; Kevin Qinghong Lin; Hardik; Shah; Mike Zheng Shou; Rama Chellappa; and Pengchuan Zhang

arXiv:2307.05463·cs.CV·August 22, 2023

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone

Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik, Shah, Mike Zheng Shou, Rama Chellappa, and Pengchuan Zhang

PDF

Open Access 1 Repo 1 Video

TL;DR

EgoVLPv2 introduces a novel egocentric video-language pre-training framework that integrates cross-modal fusion directly into the backbone, enabling efficient, flexible, and state-of-the-art performance across diverse vision-language tasks.

Contribution

It incorporates lightweight cross-modal fusion into the backbone, improving pre-training and reducing fine-tuning costs compared to previous methods.

Findings

01

Achieves state-of-the-art results on multiple VL tasks.

02

Demonstrates efficient fusion strategy with reduced computational overhead.

03

Supports diverse downstream tasks with a unified model.

Abstract

Video-language pre-training (VLP) has become increasingly important due to its ability to generalize to various vision and language tasks. However, existing egocentric VLP frameworks utilize separate video and language encoders and learn task-specific cross-modal information only during fine-tuning, limiting the development of a unified system. In this work, we introduce the second generation of egocentric video-language pre-training (EgoVLPv2), a significant improvement from the previous generation, by incorporating cross-modal fusion directly into the video and language backbones. EgoVLPv2 learns strong video-text representation during pre-training and reuses the cross-modal attention modules to support different downstream tasks in a flexible and efficient manner, reducing fine-tuning costs. Moreover, our proposed fusion in the backbone strategy is more lightweight and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/EgoVLPv2
pytorchOfficial

Videos

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition