TL;DR
This paper introduces an unsupervised method for learning video representations by combining clustering in IDT descriptor space with local aggregation, leading to improved action recognition performance and better motion capture.
Contribution
The paper adapts image discrimination-based learning objectives to videos and incorporates IDT descriptors as motion priors for enhanced unsupervised video representation learning.
Findings
Outperforms prior methods on UCF101 and HMDB51 benchmarks.
Successfully captures video motion dynamics.
Clustering in IDT space improves motion-related representation quality.
Abstract
This paper addresses the task of unsupervised learning of representations for action recognition in videos. Previous works proposed to utilize future prediction, or other domain-specific objectives to train a network, but achieved only limited success. In contrast, in the relevant field of image representation learning, simpler, discrimination-based methods have recently bridged the gap to fully-supervised performance. We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation, to the video domain. In particular, the latter approach iterates between clustering the videos in the feature space of a network and updating it to respect the cluster with a non-parametric classification loss. We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns, grouping the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
