Unsupervised Learning of Visual Representations using Videos

Xiaolong Wang; Abhinav Gupta

arXiv:1505.00687·cs.CV·October 7, 2015·202 cites

Unsupervised Learning of Visual Representations using Videos

Xiaolong Wang, Abhinav Gupta

PDF

Open Access

TL;DR

This paper introduces an unsupervised method for training CNNs using unlabeled videos and visual tracking, achieving competitive performance to supervised models without using labeled images.

Contribution

The paper presents a novel unsupervised learning approach for CNNs that leverages visual tracking in videos, eliminating the need for labeled datasets like ImageNet.

Findings

01

Achieves 52% mAP on object detection without labeled images

02

Close performance to supervised models trained on ImageNet

03

Effective in tasks like surface-normal estimation

Abstract

Is strong supervision necessary for learning a good visual representation? Do we really need millions of semantically-labeled images to train a Convolutional Neural Network (CNN)? In this paper, we present a simple yet surprisingly powerful approach for unsupervised learning of CNN. Specifically, we use hundreds of thousands of unlabeled videos from the web to learn visual representations. Our key idea is that visual tracking provides the supervision. That is, two patches connected by a track should have similar visual representation in deep feature space since they probably belong to the same object or object part. We design a Siamese-triplet network with a ranking loss function to train this CNN representation. Without using a single image from ImageNet, just using 100K unlabeled videos and the VOC 2012 dataset, we train an ensemble of unsupervised networks that achieves 52% mAP (no…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Advanced Vision and Imaging