Memory-augmented Dense Predictive Coding for Video Representation Learning
Tengda Han, Weidi Xie, Andrew Zisserman

TL;DR
This paper introduces MemDPC, a novel self-supervised learning framework for video representations that leverages memory-augmented predictive coding to improve action recognition and related tasks with less data.
Contribution
The paper proposes MemDPC, a new architecture with a predictive attention mechanism over compressed memories for efficient, hypothesis-generating video representation learning.
Findings
Achieves state-of-the-art performance on multiple downstream tasks.
Requires significantly less training data than previous methods.
Demonstrates effectiveness across RGB and optical flow inputs.
Abstract
The objective of this paper is self-supervised learning from video, in particular for representations for action recognition. We make the following contributions: (i) We propose a new architecture and learning framework Memory-augmented Dense Predictive Coding (MemDPC) for the task. It is trained with a predictive attention mechanism over the set of compressed memories, such that any future states can always be constructed by a convex combination of the condense representations, allowing to make multiple hypotheses efficiently. (ii) We investigate visual-only self-supervised video representation learning from RGB frames, or from unsupervised optical flow, or both. (iii) We thoroughly evaluate the quality of learnt representation on four different downstream tasks: action recognition, video retrieval, learning with scarce annotations, and unintentional action classification. In all…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
