Incorporating Scalability in Unsupervised Spatio-Temporal Feature Learning
Sujoy Paul, Sourya Roy, Amit K. Roy-Chowdhury

TL;DR
This paper introduces a simple yet effective unsupervised framework for learning spatio-temporal features from videos using a Convolutional 3D Siamese network, reducing reliance on labeled data.
Contribution
It presents a novel unsupervised learning approach with a Siamese network for spatio-temporal feature embedding from unlabeled videos.
Findings
Effective feature learning across multiple datasets
Applicable to various computer vision tasks
Reduces need for labeled video data
Abstract
Deep neural networks are efficient learning machines which leverage upon a large amount of manually labeled data for learning discriminative features. However, acquiring substantial amount of supervised data, especially for videos can be a tedious job across various computer vision tasks. This necessitates learning of visual features from videos in an unsupervised setting. In this paper, we propose a computationally simple, yet effective, framework to learn spatio-temporal feature embedding from unlabeled videos. We train a Convolutional 3D Siamese network using positive and negative pairs mined from videos under certain probabilistic assumptions. Experimental results on three datasets demonstrate that our proposed framework is able to learn weights which can be used for same as well as cross dataset and tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Advanced Image and Video Retrieval Techniques
MethodsSiamese Network
