Self-supervised video pretraining yields robust and more human-aligned   visual representations

Nikhil Parthasarathy; S. M. Ali Eslami; Jo\~ao Carreira; Olivier J.; H\'enaff

arXiv:2210.06433·cs.CV·January 13, 2025

Self-supervised video pretraining yields robust and more human-aligned visual representations

Nikhil Parthasarathy, S. M. Ali Eslami, Jo\~ao Carreira, Olivier J., H\'enaff

PDF

Open Access

TL;DR

This paper introduces VITO, a self-supervised video pretraining method that produces versatile, robust, and human-aligned visual representations, outperforming existing models across various tasks and perturbations.

Contribution

VITO is a novel contrastive framework for video pretraining that enhances generalization, robustness, and human alignment of visual representations.

Findings

01

VITO outperforms prior video pretraining on image understanding tasks.

02

VITO representations are more robust to natural and synthetic deformations.

03

VITO's predictions align more closely with human judgments.

Abstract

Humans learn powerful representations of objects and scenes by observing how they evolve over time. Yet, outside of specific tasks that require explicit temporal understanding, static image pretraining remains the dominant paradigm for learning visual foundation models. We question this mismatch, and ask whether video pretraining can yield visual representations that bear the hallmarks of human perception: generalisation across tasks, robustness to perturbations, and consistency with human judgements. To that end we propose a novel procedure for curating videos, and develop a contrastive framework which learns from the complex transformations therein. This simple paradigm for distilling knowledge from videos, called VITO, yields general representations that far outperform prior video pretraining methods on image understanding tasks, and image pretraining methods on video understanding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Cell Image Analysis Techniques · Advanced Vision and Imaging

MethodsContrastive Learning