A Simple Recipe for Contrastively Pre-training Video-First Encoders   Beyond 16 Frames

Pinelopi Papalampidi; Skanda Koppula; Shreya Pathak; Justin Chiu; Joe; Heyward; Viorica Patraucean; Jiajun Shen; Antoine Miech; Andrew Zisserman,; Aida Nematzadeh

arXiv:2312.07395·cs.CV·December 31, 2024·1 cites

A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames

Pinelopi Papalampidi, Skanda Koppula, Shreya Pathak, Justin Chiu, Joe, Heyward, Viorica Patraucean, Jiajun Shen, Antoine Miech, Andrew Zisserman,, Aida Nematzadeh

PDF

Open Access

TL;DR

This paper introduces a simple, memory-efficient method for contrastively pre-training long, video-first encoders that outperform larger models on long-range video understanding benchmarks without architectural complexity.

Contribution

It demonstrates that large portions of videos can be masked during pre-training to effectively scale video encoders to longer durations, surpassing existing segment-based methods.

Findings

01

Masking up to 75% of videos improves scalability and performance.

02

Contrastive pre-training with masking outperforms larger LLM-based approaches.

03

The method scales to models with 1 billion parameters for long video understanding.

Abstract

Understanding long, real-world videos requires modeling of long-range visual dependencies. To this end, we explore video-first architectures, building on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion. However, we expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed. To mitigate the memory bottleneck, we systematically analyze the memory/accuracy trade-off of various efficient methods: factorized attention, parameter-efficient image-to-video adaptation, input masking, and multi-resolution patchification. Surprisingly, simply masking large portions of the video (up to 75%) during contrastive pre-training proves to be one of the most robust ways…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Cell Image Analysis Techniques