Auxiliary Learning for Self-Supervised Video Representation via Similarity-based Knowledge Distillation
Amirhossein Dadashzadeh, Alan Whone, Majid Mirmehdi

TL;DR
This paper introduces auxSKD, a knowledge distillation-based auxiliary pretraining method for self-supervised video representation learning, improving generalization on smaller datasets and across domain differences, with a new pretext task called VSPP.
Contribution
It proposes auxSKD, a novel auxiliary pretraining approach using similarity-based knowledge distillation, and introduces VSPP, a new pretext task for better video representations.
Findings
AuxSKD outperforms state-of-the-art on UCF101 and HMDB51 datasets.
Adding auxSKD improves existing self-supervised methods like VCOP, VideoPace, and RSPNet.
Our method enhances generalization on smaller and domain-shifted datasets.
Abstract
Despite the outstanding success of self-supervised pretraining methods for video representation learning, they generalise poorly when the unlabeled dataset for pretraining is small or the domain difference between unlabelled data in source task (pretraining) and labeled data in target task (finetuning) is significant. To mitigate these issues, we propose a novel approach to complement self-supervised pretraining via an auxiliary pretraining phase, based on knowledge similarity distillation, auxSKD, for better generalisation with a significantly smaller amount of video data, e.g. Kinetics-100 rather than Kinetics-400. Our method deploys a teacher network that iteratively distills its knowledge to the student model by capturing the similarity information between segments of unlabelled video data. The student model meanwhile solves a pretext task by exploiting this prior knowledge. We also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Human Pose and Action Recognition
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
