Self-Supervised Video Representation Learning in a Heuristic Decoupled Perspective

Zeen Song; Wenwen Qiang; Changwen Zheng; Hui Xiong; Gang Hua

arXiv:2407.14069·cs.CV·February 9, 2026

Self-Supervised Video Representation Learning in a Heuristic Decoupled Perspective

Zeen Song, Wenwen Qiang, Changwen Zheng, Hui Xiong, Gang Hua

PDF

Open Access

TL;DR

This paper introduces BOD-VCL, a novel method for video contrastive learning that explicitly separates static and dynamic semantics using Koopman theory, leading to improved unsupervised video representations.

Contribution

It proposes a bi-level optimization approach that decouples static and dynamic features in videos, addressing limitations of existing contrastive learning methods.

Findings

01

Significant performance improvements in action classification tasks.

02

Effective separation of static and dynamic semantics enhances representation quality.

03

The method seamlessly integrates with existing V-CL frameworks.

Abstract

Video contrastive learning (V-CL) has emerged as a popular framework for unsupervised video representation learning, demonstrating strong results in tasks such as action classification and detection. Yet, to harness these benefits, it is critical for the learned representations to fully capture both static and dynamic semantics. However, our experiments show that existing V-CL methods fail to effectively learn either type of feature. Through a rigorous theoretical analysis based on the Structural Causal Model and gradient update, we find that in a given dataset, certain static semantics consistently co-occur with specific dynamic semantics. This phenomenon creates spurious correlations between static and dynamic semantics in the dataset. However, existing V-CL methods do not differentiate static and dynamic similarities when computing sample similarity. As a result, learning only one…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition

MethodsContrastive Learning