A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames
Pinelopi Papalampidi, Skanda Koppula, Shreya Pathak, Justin Chiu, Joe, Heyward, Viorica Patraucean, Jiajun Shen, Antoine Miech, Andrew Zisserman,, Aida Nematzadeh

TL;DR
This paper introduces a simple, memory-efficient method for contrastively pre-training long, video-first encoders that outperform larger models on long-range video understanding benchmarks without architectural complexity.
Contribution
It demonstrates that large portions of videos can be masked during pre-training to effectively scale video encoders to longer durations, surpassing existing segment-based methods.
Findings
Masking up to 75% of videos improves scalability and performance.
Contrastive pre-training with masking outperforms larger LLM-based approaches.
The method scales to models with 1 billion parameters for long video understanding.
Abstract
Understanding long, real-world videos requires modeling of long-range visual dependencies. To this end, we explore video-first architectures, building on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion. However, we expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed. To mitigate the memory bottleneck, we systematically analyze the memory/accuracy trade-off of various efficient methods: factorized attention, parameter-efficient image-to-video adaptation, input masking, and multi-resolution patchification. Surprisingly, simply masking large portions of the video (up to 75%) during contrastive pre-training proves to be one of the most robust ways…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Cell Image Analysis Techniques
