Video Representation Learning by Dense Predictive Coding
Tengda Han, Weidi Xie, Andrew Zisserman

TL;DR
This paper introduces Dense Predictive Coding (DPC), a self-supervised learning framework for video representations that predicts future spatio-temporal features, leading to state-of-the-art action recognition performance after fine-tuning.
Contribution
The paper presents a novel DPC framework, a curriculum training scheme for better temporal prediction, and demonstrates strong results on action recognition benchmarks.
Findings
DPC achieves state-of-the-art self-supervised results on UCF101 and HMDB51.
Pretraining with DPC approaches ImageNet pretraining performance.
The curriculum scheme improves the semantic quality of learned representations.
Abstract
The objective of this paper is self-supervised learning of spatio-temporal embeddings from video, suitable for human action recognition. We make three contributions: First, we introduce the Dense Predictive Coding (DPC) framework for self-supervised representation learning on videos. This learns a dense encoding of spatio-temporal blocks by recurrently predicting future representations; Second, we propose a curriculum training scheme to predict further into the future with progressively less temporal context. This encourages the model to only encode slowly varying spatial-temporal signals, therefore leading to semantic representations; Third, we evaluate the approach by first training the DPC model on the Kinetics-400 dataset with self-supervised learning, and then finetuning the representation on a downstream task, i.e. action recognition. With single stream (RGB only), DPC pretrained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Domain Adaptation and Few-Shot Learning · Anomaly Detection Techniques and Applications
MethodsAverage Pooling · Residual Connection · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Batch Normalization · Bottleneck Residual Block · Global Average Pooling · Residual Block · Kaiming Initialization · Max Pooling
