Self-supervised Video Representation Learning by Uncovering   Spatio-temporal Statistics

Jiangliu Wang; Jianbo Jiao; Linchao Bao; Shengfeng He; Wei Liu; and; Yun-hui Liu

arXiv:2008.13426·cs.CV·February 1, 2021·23 cites

Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics

Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He, Wei Liu, and, Yun-hui Liu

PDF

Open Access 2 Repos

TL;DR

This paper introduces a self-supervised learning method for videos that uses spatio-temporal statistical summaries as a pretext task, improving performance on various video analysis benchmarks.

Contribution

It proposes a novel pretext task based on statistical summaries and spatial partitioning, enhancing self-supervised video representation learning.

Findings

01

Outperforms existing methods on multiple downstream tasks

02

Effective across various 3D backbone networks

03

Improves action recognition, retrieval, and scene understanding

Abstract

This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization

MethodsDense Connections · Batch Normalization · Average Pooling · Global Average Pooling · *Communicated@Fast*How Do I Communicate to Expedia? · Residual Connection · (2+1)D Convolution · R(2+1)D