Static and Dynamic Concepts for Self-supervised Video Representation Learning
Rui Qian, Shuangrui Ding, Xian Liu, Dahua Lin

TL;DR
This paper introduces a self-supervised video representation learning method that models static and dynamic concepts separately, using a novel learning scheme with regularizations and cross-attention, achieving state-of-the-art results on multiple datasets.
Contribution
It proposes a new approach to decouple static and dynamic concepts in videos using static frames and frame differences, with regularizations and cross-attention for improved understanding.
Findings
Achieves state-of-the-art accuracy on UCF-101, HMDB-51, and Diving-48 datasets.
Effectively disentangles static and dynamic concepts for better video representation.
Demonstrates the importance of local concept attention in video understanding.
Abstract
In this paper, we propose a novel learning scheme for self-supervised video representation learning. Motivated by how humans understand videos, we propose to first learn general visual concepts then attend to discriminative local areas for video understanding. Specifically, we utilize static frame and frame difference to help decouple static and dynamic concepts, and respectively align the concept distributions in latent space. We add diversity and fidelity regularizations to guarantee that we learn a compact set of meaningful concepts. Then we employ a cross-attention mechanism to aggregate detailed local features of different concepts, and filter out redundant concepts with low activations to perform local concept contrast. Extensive experiments demonstrate that our method distills meaningful static and dynamic concepts to guide video understanding, and obtains state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Human Pose and Action Recognition
MethodsALIGN
