Advancing Video Self-Supervised Learning via Image Foundation Models

Jingwei Wu; Zhewei Huang; Chang Liu

arXiv:2505.19218·cs.CV·May 27, 2025

Advancing Video Self-Supervised Learning via Image Foundation Models

Jingwei Wu, Zhewei Huang, Chang Liu

PDF

1 Repo

TL;DR

This paper introduces AdViSe, a method that leverages pre-trained image foundation models for efficient video self-supervised learning, significantly reducing training time and memory while maintaining high performance.

Contribution

It proposes a novel approach combining image foundation models with temporal modules for low-cost video self-supervised learning.

Findings

01

Achieves comparable performance to state-of-the-art methods on UCF101.

02

Reduces training time by 3.4 times.

03

Lowers GPU memory usage by 8.2 times.

Abstract

In the past decade, image foundation models (IFMs) have achieved unprecedented progress. However, the potential of directly using IFMs for video self-supervised representation learning has largely been overlooked. In this study, we propose an advancing video self-supervised learning (AdViSe) approach, aimed at significantly reducing the training overhead of video representation models using pre-trained IFMs. Specifically, we first introduce temporal modeling modules (ResNet3D) to IFMs, constructing a video representation model. We then employ a video self-supervised learning approach, playback rate perception, to train temporal modules while freezing the IFM components. Experiments on UCF101 demonstrate that AdViSe achieves performance comparable to state-of-the-art methods while reducing training time by $3.4 \times$ and GPU memory usage by $8.2 \times$ . This study offers fresh insights…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jingwwu/advise-video-ssl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.