Unsupervised Learning of Visual Representations using Videos
Xiaolong Wang, Abhinav Gupta

TL;DR
This paper introduces an unsupervised method for training CNNs using unlabeled videos and visual tracking, achieving competitive performance to supervised models without using labeled images.
Contribution
The paper presents a novel unsupervised learning approach for CNNs that leverages visual tracking in videos, eliminating the need for labeled datasets like ImageNet.
Findings
Achieves 52% mAP on object detection without labeled images
Close performance to supervised models trained on ImageNet
Effective in tasks like surface-normal estimation
Abstract
Is strong supervision necessary for learning a good visual representation? Do we really need millions of semantically-labeled images to train a Convolutional Neural Network (CNN)? In this paper, we present a simple yet surprisingly powerful approach for unsupervised learning of CNN. Specifically, we use hundreds of thousands of unlabeled videos from the web to learn visual representations. Our key idea is that visual tracking provides the supervision. That is, two patches connected by a track should have similar visual representation in deep feature space since they probably belong to the same object or object part. We design a Siamese-triplet network with a ranking loss function to train this CNN representation. Without using a single image from ImageNet, just using 100K unlabeled videos and the VOC 2012 dataset, we train an ensemble of unsupervised networks that achieves 52% mAP (no…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Advanced Vision and Imaging
