EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik, Shah, Mike Zheng Shou, Rama Chellappa, and Pengchuan Zhang

TL;DR
EgoVLPv2 introduces a novel egocentric video-language pre-training framework that integrates cross-modal fusion directly into the backbone, enabling efficient, flexible, and state-of-the-art performance across diverse vision-language tasks.
Contribution
It incorporates lightweight cross-modal fusion into the backbone, improving pre-training and reducing fine-tuning costs compared to previous methods.
Findings
Achieves state-of-the-art results on multiple VL tasks.
Demonstrates efficient fusion strategy with reduced computational overhead.
Supports diverse downstream tasks with a unified model.
Abstract
Video-language pre-training (VLP) has become increasingly important due to its ability to generalize to various vision and language tasks. However, existing egocentric VLP frameworks utilize separate video and language encoders and learn task-specific cross-modal information only during fine-tuning, limiting the development of a unified system. In this work, we introduce the second generation of egocentric video-language pre-training (EgoVLPv2), a significant improvement from the previous generation, by incorporating cross-modal fusion directly into the video and language backbones. EgoVLPv2 learns strong video-text representation during pre-training and reuses the cross-modal attention modules to support different downstream tasks in a flexible and efficient manner, reducing fine-tuning costs. Moreover, our proposed fusion in the backbone strategy is more lightweight and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
