Unsupervised Pre-Training of Image Features on Non-Curated Data

Mathilde Caron; Piotr Bojanowski; Julien Mairal; Armand Joulin

arXiv:1905.01278·cs.CV·August 14, 2019

Unsupervised Pre-Training of Image Features on Non-Curated Data

Mathilde Caron, Piotr Bojanowski, Julien Mairal, Armand Joulin

PDF

2 Repos

TL;DR

This paper introduces a new unsupervised pre-training method for image features using large-scale uncurated data, achieving state-of-the-art results and improving supervised classification accuracy.

Contribution

The paper presents a novel self-supervised clustering approach that effectively leverages massive uncurated datasets for visual feature learning.

Findings

01

Achieved state-of-the-art results on standard benchmarks for unsupervised methods.

02

Pre-training with our method improves supervised ImageNet classification accuracy.

03

Validated the effectiveness of unsupervised learning on 96 million uncurated images.

Abstract

Pre-training general-purpose visual features with convolutional neural networks without relying on annotations is a challenging and important task. Most recent efforts in unsupervised feature learning have focused on either small or highly curated datasets like ImageNet, whereas using uncurated raw datasets was found to decrease the feature quality when evaluated on a transfer task. Our goal is to bridge the performance gap between unsupervised methods trained on curated data, which are costly to obtain, and massive raw datasets that are easily available. To that effect, we propose a new unsupervised approach which leverages self-supervision and clustering to capture complementary statistics from large-scale data. We validate our approach on 96 million images from YFCC100M, achieving state-of-the-art results among unsupervised methods on standard benchmarks, which confirms the potential…

Figures35

Click any figure to enlarge with its caption.

Equations2

\frac{1}{N}\sum_{n=1}^{N}\left[\ell\big{(}Vf_{\theta}(x_{n}),y_{n}\big{)}{+}\sum_{s=1}^{S}y_{ns}\ell\left(W_{s}f_{\theta}(x_{n}),z^{s}_{n}\right)\right],

\frac{1}{N}\sum_{n=1}^{N}\left[\ell\big{(}Vf_{\theta}(x_{n}),y_{n}\big{)}{+}\sum_{s=1}^{S}y_{ns}\ell\left(W_{s}f_{\theta}(x_{n}),z^{s}_{n}\right)\right],

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

**Leveraging Large-Scale Uncurated Data for Unsupervised Learning of Visual Features **

**Mathilde Caron, Piotr Bojanowski, Armand Joulin and Julien Mairal

**

•

Goal

Learning general-purpose visual features with convnets on large-scale unsupervised and uncurated datasets.

•

Motivation

–

bridge the performance gap between unsupervised methods trained on curated data, which are costly to obtain, and massive raw datasets that are easily available;

–

new unsupervised approach which leverages self-supervision and clustering to capture complementary statistics from large-scale data.

•

Method

Our approah, iterates between:

–

hierarchical clustering of the features;

–

updating convnet weights by predicting both rotation angle and cluster assignment in a single hierarchical loss.

•

Results

Features pre-trained on $95$ M images from YFCC100M with state-of-the-art performance on standard evaluation benchmarks with VGG- $16$ .

**Overview

**

**Illustration of our approach

**

•

A large set of unlabelled images $\{x_{1},\ldots,x_{N}\}$ , $x_{i}$ in $\mathbb{R}^{3\times 224\times 224}$ .

•

$f_{\theta}$ is the convnet mapping (with $\theta$ the set of corresponding parameters).

•

We partition the target labels into a $2$ -level hierarchy:

Super-classes: $y_{n}$ the super-class assignment vector in $\{0,1\}^{S}$ of the image $n$ ;

Sub-classes: partitioning within each super-class. $z^{s}_{n}$ is the vector in $\{0,1\}^{k_{s}}$ of the assignment into $k_{s}$ sub-classes for an image $n$ belonging to super-class $s$ .

•

Parameters of linear classifiers $(V,W_{1},\dots,W_{S})$ and $\theta$ are learned by minimizing:

$\frac{1}{N}\sum_{n=1}^{N}\left[\ell\big{(}Vf_{\theta}(x_{n}),y_{n}\big{)}{+}\sum_{s=1}^{S}y_{ns}\ell\left(W_{s}f_{\theta}(x_{n}),z^{s}_{n}\right)\right],$

where $\ell$ is the negative log-softmax function.

**Method

**

Classif.

Detect.

Method Data

fc68 all

ImageNet labels ImageNet

$89.3$ $86.9$

$57.0$ $67.3$

Unsupervised on curated data

Larsson et al. [larsson2017colorization] ImageNet + Places

– $77.2$

$45.6$ $59.7$

Doersh et al. [doersch2015unsupervised] ImageNet

$54.6$ $78.5$

$38.0$ $62.7$

Caron et al. [caron2018deep] ImageNet

$78.5$ $82.3$

$\mathbf{57.1}$ $65.9$

Unsupervised on uncurated data

Mahendran et al. [mahendran2018cross] YFCC100M videos

– $76.4$

– –

Wang and Gupta [wang2015unsupervised] Youtube8M

– –

– $60.2$

Wang et al. [wang2017transitive] Youtube9M

$59.4$ $79.6$

$40.9$ $63.2$

Our method YFCC100M

$\mathbf{79.9}$ $\mathbf{83.8}$

$56.9$ $\mathbf{67.5}$

**Transfer learning to Pascal VOC 2007

**

We train logistic regressions on top of frozen convolutional layers at different depths.

**Comparing with methods on YFCC100M

**

We report validation mAP on Pascal VOC classification task (fc68 setting).

**Amounts of images and clusters

**

We display $9$ random images for clusters pure for a certain metadata. The bottom row depicts clusters that are pure for GPS coordinates but unpure for user IDs.

tag: cat tag: elephantparadelondon tag: always device: CanoScan

GPS: ( $43$ , $10$ ) GPS: ( $-34$ , $-151$ ) GPS: ( $64$ , $-20$ ) GPS: ( $43$ , $-104$ )

**Clustering quality

**

**References

**