TL;DR
This paper introduces Siamese deep network architectures with novel loss functions for unsupervised learning from videos, effectively capturing temporal coherence and discriminating between different video contents to improve action and scene recognition.
Contribution
The authors propose two Siamese CNN architectures with new loss functions that jointly leverage local temporal coherence and global discriminative margins for unsupervised video learning.
Findings
Learned features can discover actions and scenes in videos.
Unsupervised features outperform traditional supervised pre-training in recognition tasks.
Abstract
In this work we address the challenging problem of unsupervised learning from videos. Existing methods utilize the spatio-temporal continuity in contiguous video frames as regularization for the learning process. Typically, this temporal coherence of close frames is used as a free form of annotation, encouraging the learned representations to exhibit small differences between these frames. But this type of approach fails to capture the dissimilarity between videos with different content, hence learning less discriminative features. We here propose two Siamese architectures for Convolutional Neural Networks, and their corresponding novel loss functions, to learn from unlabeled videos, which jointly exploit the local temporal coherence between contiguous frames, and a global discriminative margin used to separate representations of different videos. An extensive experimental evaluation is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
