Self-supervised video pretraining yields robust and more human-aligned visual representations
Nikhil Parthasarathy, S. M. Ali Eslami, Jo\~ao Carreira, Olivier J., H\'enaff

TL;DR
This paper introduces VITO, a self-supervised video pretraining method that produces versatile, robust, and human-aligned visual representations, outperforming existing models across various tasks and perturbations.
Contribution
VITO is a novel contrastive framework for video pretraining that enhances generalization, robustness, and human alignment of visual representations.
Findings
VITO outperforms prior video pretraining on image understanding tasks.
VITO representations are more robust to natural and synthetic deformations.
VITO's predictions align more closely with human judgments.
Abstract
Humans learn powerful representations of objects and scenes by observing how they evolve over time. Yet, outside of specific tasks that require explicit temporal understanding, static image pretraining remains the dominant paradigm for learning visual foundation models. We question this mismatch, and ask whether video pretraining can yield visual representations that bear the hallmarks of human perception: generalisation across tasks, robustness to perturbations, and consistency with human judgements. To that end we propose a novel procedure for curating videos, and develop a contrastive framework which learns from the complex transformations therein. This simple paradigm for distilling knowledge from videos, called VITO, yields general representations that far outperform prior video pretraining methods on image understanding tasks, and image pretraining methods on video understanding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Cell Image Analysis Techniques · Advanced Vision and Imaging
MethodsContrastive Learning
