Low-shot learning with large-scale diffusion

Matthijs Douze; Arthur Szlam; Bharath Hariharan; Herv\'e; J\'egou

arXiv:1706.02332·cs.CV·June 18, 2018

Low-shot learning with large-scale diffusion

Matthijs Douze, Arthur Szlam, Bharath Hariharan, Herv\'e, J\'egou

PDF

1 Repo

TL;DR

This paper introduces a large-scale label propagation method for low-shot image classification, leveraging similarity graphs over hundreds of millions of images to achieve state-of-the-art accuracy.

Contribution

It demonstrates that scaling label propagation to massive datasets significantly improves low-shot learning performance.

Findings

01

Scaling label propagation to hundreds of millions of images yields state-of-the-art accuracy.

02

Large-scale similarity graphs effectively support label propagation in low-shot learning.

03

The approach outperforms traditional fine-tuning methods in low-data regimes.

Abstract

This paper considers the problem of inferring image labels from images when only a few annotated examples are available at training time. This setup is often referred to as low-shot learning, where a standard approach is to re-train the last few layers of a convolutional neural network learned on separate classes for which training examples are abundant. We consider a semi-supervised setting based on a large collection of images to support label propagation. This is possible by leveraging the recent advances on large-scale similarity graph construction. We show that despite its conceptual simplicity, scaling label propagation up to hundred millions of images leads to state of the art accuracy in the low-shot learning regime.

Figures34

Click any figure to enlarge with its caption.

Tables1

Table 1. Table \thetable : Comparison of classifiers for different values of n 𝑛 n , with k = 30 𝑘 30 k=30 for the diffusion results, evaluating only on novel classes using the two different sets of features that we consider.

	out-of-domain diffusion				in-domain	logistic	combined		best reported
$n$	none	F1M	F10M	F100M	(Imagenet)	regression	+F10M	+ F100M	results [bharath2017low]
$1$	39.4 $\pm$ 0.85	43.9 $\pm$ 0.96	46.3 $\pm$ 1.28	47.6 $\pm$ 1.09	57.7 $\pm$ 1.28	42.6 $\pm$ 1.31	46.5 $\pm$ 1.23	47.9 $\pm$ 1.18	45.1
$2$	47.8 $\pm$ 0.94	52.7 $\pm$ 1.14	55.2 $\pm$ 1.21	57.0 $\pm$ 1.05	66.9 $\pm$ 1.06	54.4 $\pm$ 1.29	57.5 $\pm$ 1.34	58.4 $\pm$ 1.29	58.8
$5$	56.8 $\pm$ 0.73	62.2 $\pm$ 0.44	64.6 $\pm$ 0.57	66.3 $\pm$ 0.68	73.8 $\pm$ 0.29	71.4 $\pm$ 0.54	71.9 $\pm$ 0.55	72.3 $\pm$ 0.58	72.7
$10$	64.9 $\pm$ 0.28	68.0 $\pm$ 0.33	69.9 $\pm$ 0.47	71.7 $\pm$ 0.54	77.6 $\pm$ 0.23	78.6 $\pm$ 0.27	78.7 $\pm$ 0.21	79.2 $\pm$ 0.14	79.1
$20$	71.4 $\pm$ 0.26	72.7 $\pm$ 0.40	74.1 $\pm$ 0.38	75.3 $\pm$ 0.29	80.0 $\pm$ 0.21	82.9 $\pm$ 0.20	83.0 $\pm$ 0.15	83.2 $\pm$ 0.22	82.6

Equations4

a_{n} lo g (l_{i c}^{logreg}) + (1 - a_{n}) lo g (l_{i c}^{dif}),

a_{n} lo g (l_{i c}^{logreg}) + (1 - a_{n}) lo g (l_{i c}^{dif}),

\mathbf{W}_{0}=\left[\begin{array}[]{cc}\mathbf{W}_{\mathrm{LL}}&\mathbf{W}_{\mathrm{LB}}\\ \mathbf{W}_{\mathrm{BL}}&\mathbf{W}_{\mathrm{BB}}\\ \end{array}\right]\in\{0,1\}^{(n_{\mathrm{L}}+n_{\mathrm{B}})\times(n_{\mathrm{L}}+n_{\mathrm{B}})},

\mathbf{W}_{0}=\left[\begin{array}[]{cc}\mathbf{W}_{\mathrm{LL}}&\mathbf{W}_{\mathrm{LB}}\\ \mathbf{W}_{\mathrm{BL}}&\mathbf{W}_{\mathrm{BB}}\\ \end{array}\right]\in\{0,1\}^{(n_{\mathrm{L}}+n_{\mathrm{B}})\times(n_{\mathrm{L}}+n_{\mathrm{B}})},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/low-shot-with-diffusion
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\definecolor

darkgreenRGB0, 140, 0

Supplementary material for:

Low-shot learning with large-scale diffusion

Matthijs Douze $\dagger$ , Arthur Szlam $\dagger$ , Bharath Hariharan $\dagger$ , Hervé Jégou $\dagger$

$\dagger$ Facebook AI Research

*Cornell University This work was carried out while B. Hariharan was post-doc at FAIR.

\appendix

We present several additional results and details to complement the paper. Section 1 reports another evaluation protocol, which restricts the evaluation to novel classes. Sections 2 and 3 are parametric evaluations. Section 4 gives some details about the graph computation.

1 Evaluation results on novel classes

In the main paper, we evaluated the search performance on all the test images from group 2. The performance restricted to only the novel classes is also reported in prior work [bharath2017low] using a combination of classifiers. Table 1 shows the results in this setting.

As to be expected, the results reported in these tables are inferior to those obtained in the setup where all test images are classified. This is because the novel classes are harder to classify than the base classes. Otherwise the ordering of the methods is preserved and the conclusions identical. The diffusion is effective in the low-shot regime and is, by itself, better than the state of the art by a large margin when only one example is available. The combination with late fusion significantly outperforms the state of the art, even in the out-of-domain setup.

2 Details of the parametric evaluation

In the paper we reported results for the edge weighting and graph normalization with the best parameter setting. Here, we report results for all parameters111Note that our parametric experiments use the set of baseline image descriptors used in the arXiv version of the paper by Barath \etal [bharath2017low], and the figure compares all methods using those underlying features. Therefore the results are not directly comparable with the rest of the paper. . We evaluate the following edge weightings (Figure 2, first row):

•

Gaussian weighting. The edge weight is $e^{-x^{2}/\sigma^{2}}$ with $x$ the distance between the edge nodes. Note that $\sigma\rightarrow\infty$ corresponds to a constant weighting;

•

Weighting based on the “meaningful neighbors” proposal [ODL07]. It relies on an exponential fit of neighbor distances. For a given graph node, for the neighbor $i$ of its list of results, the weight is $s(1-e^{-\lambda s})^{k}$ , where $s$ is the distance, remapped linearly to $[0,1]$ so that the first neighbor has $s=1$ and the $k$ th neighbor has $s=0$ . We vary parameter $\lambda$ in the plot.

We also report results for different normalizations of the matrix $\mathbf{L}$ . In Figure 2 (second row), we compare:

•

The non-linear $\Gamma_{r}$ normalization, all elements of $\mathbf{L}$ are raised to a power $r$ . We vary the parameter $r$ , and $r=1$ corresponds to the identity transform;

•

We classify all images in a graph with a logistic regression classifier. We use the predicted frequency of each class over the whole graph, and raise it to some power (the parameter) to reduce or increase its peakiness. This choice is inspired by the Markov Clustering Algorithm [EDO2002]. This gives a normalization factor that we enforce for each column of $\mathbf{L}$ , instead of the default uniform distribution.

The conclusion of these experiments is that these variants do not improve over constant weights and a standard diffusion, most of them having a neutral effect. Therefore, we conclude that the diffusion process mostly depends on the topology of the graph.

3 Late fusion weights

Let denote by $l_{i}^{\mathrm{logreg}}$ and $l_{i}^{\mathrm{dif}}\in[0,1]^{C}$ the distributions over classes returned by the two classifiers for image $i$ . We fuse the loglikehood by a weighted average, which amounts to retrieving the top-5 class prediction as those maximizing

[TABLE]

where $a_{n}$ is the optimal mixing coefficient for $n$ seed points, as found by cross-validation.

Figure 3 shows these optimal mixing factors. Since the logistic regression is better at classifying with many training examples, the parameter $a_{n}$ increases with $n$ .

4 Computation of the $\mathbf{W}_{0}$ blocks

As stated in the paper, we need to compute the 4 blocks of the matrix $\mathbf{W}_{0}$ :

[TABLE]

where, usually, $n_{\mathrm{L}}\ll n_{\mathrm{B}}$ . Each block requires to perform a $k$ -nearest neighbor search. We employ the Faiss library222http://github.com/facebookresearch/faiss optimized for this task [johnson2017billion], and use it as follows:

•

$\mathbf{W}_{\mathrm{BB}}\in\{0,1\}^{n_{\mathrm{B}}\times n_{\mathrm{B}}}$ : we use a Faiss index referred to as “IVFFlat”. The accuracy-speed compromise is controlled by a parameter giving the number of inverted lists visited at search time. We adopted a relatively high probe setting (256) to guarantee that most of the actual neighbors are retrieved. With the recommended settings of Faiss, the complexity of one search is proportional to $d\sqrt{n_{\mathrm{B}}}$ , so the total complexity is $\mathcal{O}(dn_{\mathrm{B}}^{1.5})$ . This is super-linear with respect to $n_{\mathrm{B}}$ , but it is still relatively efficient (see our timings) and performed off-line;

•

$\mathbf{W}_{\mathrm{LB}}\in\{0,1\}^{n_{\mathrm{L}}\times n_{\mathrm{B}}}$ : we re-use the same index to do $n_{\mathrm{L}}$ similarity search operations, this time using only $\mathcal{O}(dn_{\mathrm{L}}\sqrt{n_{\mathrm{B}}})$ ;

•

$\mathbf{W}_{\mathrm{BL}}\in\{0,1\}^{n_{\mathrm{B}}\times n_{\mathrm{L}}}$ : we need to index on the seed image descriptors. We found that in practice, constructing an index on these images is at best 1.4 $\times$ faster than brute-force search. Therefore, we use brute-force search in this case, which if of order $\mathcal{O}(dn_{\mathrm{B}}n_{\mathrm{L}})$ ;

•

$\mathbf{W}_{\mathrm{BB}}\in\{0,1\}^{n_{\mathrm{L}}\times n_{\mathrm{L}}}$ : it has a negligible complexity.

The fusion of the result lists $[\mathbf{W}_{\mathrm{LL}}~{}~{}\mathbf{W}_{\mathrm{LB}}]$ and $[\mathbf{W}_{\mathrm{BL}}~{}~{}\mathbf{W}_{\mathrm{BB}}]$ to get $k$ results per row of $\mathbf{W}_{0}$ is done in a single pass and in a negligible amount of time. Therefore the dominant complexity is $\mathcal{O}(dn_{\mathrm{B}}n_{\mathrm{L}})$ . A typical breakdown of the timings for F100M is (in seconds):

[TABLE]

For $\mathbf{W}_{\mathrm{LB}}$ we decompose the timing into: loading of the precomputed IVFFlat index (and moving it to GPU if appropriate) and the actual computation of the neighbors.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Supplementary material for:

1 Evaluation results on novel classes

2 Details of the parametric evaluation

3 Late fusion weights

4 Computation of the W0\mathbf{W}_{0}W0​ blocks

4 Computation of the $\mathbf{W}_{0}$ blocks