Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics
Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He, Wei Liu, and, Yun-hui Liu

TL;DR
This paper introduces a self-supervised learning method for videos that uses spatio-temporal statistical summaries as a pretext task, improving performance on various video analysis benchmarks.
Contribution
It proposes a novel pretext task based on statistical summaries and spatial partitioning, enhancing self-supervised video representation learning.
Findings
Outperforms existing methods on multiple downstream tasks
Effective across various 3D backbone networks
Improves action recognition, retrieval, and scene understanding
Abstract
This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization
MethodsDense Connections · Batch Normalization · Average Pooling · Global Average Pooling · *Communicated@Fast*How Do I Communicate to Expedia? · Residual Connection · (2+1)D Convolution · R(2+1)D
