Self-supervised learning using consistency regularization of spatio-temporal data augmentation for action recognition
Jinpeng Wang, Yiqi Lin, Andy J.Ma

TL;DR
This paper introduces a novel self-supervised learning approach for action recognition that leverages spatio-temporal consistency regularization and specialized data augmentations, significantly improving performance over existing methods.
Contribution
It proposes a new consistency regularization framework using high-level feature maps and develops two video-specific data augmentation techniques for better action feature extraction.
Findings
Achieves 22% relative improvement on HMDB51
Achieves 7% relative improvement on UCF101
Outperforms state-of-the-art self-supervised methods
Abstract
Self-supervised learning has shown great potentials in improving the deep learning model in an unsupervised manner by constructing surrogate supervision signals directly from the unlabeled data. Different from existing works, we present a novel way to obtain the surrogate supervision signal based on high-level feature maps under consistency regularization. In this paper, we propose a Spatio-Temporal Consistency Regularization between different output features generated from a siamese network including a clean path fed with original video and a noise path fed with the corresponding augmented video. Based on the Spatio-Temporal characteristics of video, we develop two video-based data augmentation methods, i.e., Spatio-Temporal Transformation and Intra-Video Mixup. Consistency of the former one is proposed to model transformation consistency of features, while the latter one aims at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Gait Recognition and Analysis · Hand Gesture Recognition Systems
MethodsMixup · Siamese Network
