Motion-Augmented Self-Training for Video Recognition at Smaller Scale
Kirill Gavrilyuk, Mihir Jain, Ilia Karmanov, Cees G. M. Snoek

TL;DR
This paper introduces MotionFit, a novel motion-augmented self-training method for video recognition that leverages optical flow and pseudo-labeling to improve performance on small-scale video datasets without requiring motion computation during inference.
Contribution
The paper presents the first motion-augmented self-training regime, MotionFit, which enhances small-scale video recognition by combining motion models, pseudo-labeling, and temporal granularity considerations.
Findings
MotionFit outperforms existing methods by 5-8% in knowledge transfer.
It surpasses video-only self-supervision by 1-7%.
It improves semi-supervised learning results by 9-18%.
Abstract
The goal of this paper is to self-train a 3D convolutional neural network on an unlabeled video collection for deployment on small-scale video collections. As smaller video datasets benefit more from motion than appearance, we strive to train our network using optical flow, but avoid its computation during inference. We propose the first motion-augmented self-training regime, we call MotionFit. We start with supervised training of a motion model on a small, and labeled, video collection. With the motion model we generate pseudo-labels for a large unlabeled video collection, which enables us to transfer knowledge by learning to predict these pseudo-labels with an appearance model. Moreover, we introduce a multi-clip loss as a simple yet efficient way to improve the quality of the pseudo-labeling, even without additional auxiliary tasks. We also take into consideration the temporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis
