Learning by Active Nonlinear Diffusion

Mauro Maggioni; James M. Murphy

arXiv:1905.12989·cs.LG·May 31, 2019

Learning by Active Nonlinear Diffusion

Mauro Maggioni, James M. Murphy

PDF

TL;DR

This paper introduces an active learning approach leveraging diffusion processes on graphs to effectively label high-dimensional data with complex geometries, providing theoretical guarantees and demonstrating strong empirical results.

Contribution

It presents a novel active learning method based on diffusion distances that handles nonlinear geometries and noise, with proven theoretical performance and efficient implementation.

Findings

01

The method achieves high-accuracy labeling with few labels.

02

It has theoretical guarantees under complex data models.

03

Demonstrates competitive results on real hyperspectral images.

Abstract

This article proposes an active learning method for high dimensional data, based on intrinsic data geometries learned through diffusion processes on graphs. Diffusion distances are used to parametrize low-dimensional structures on the dataset, which allow for high-accuracy labelings of the dataset with only a small number of carefully chosen labels. The geometric structure of the data suggests regions that have homogeneous labels, as well as regions with high label complexity that should be queried for labels. The proposed method enjoys theoretical performance guarantees on a general geometric data model, in which clusters corresponding to semantically meaningful classes are permitted to have nonlinear geometries, high ambient dimensionality, and suffer from significant noise and outlier corruption. The proposed algorithm is implemented in a manner that is quasilinear in the number of…

Equations25

D_{t} (x_{i}, x_{j}) = ∥ p_{t} (x_{i}, \cdot) - p_{t} (x_{j}, \cdot) ∥_{l^{2} (w)} = ℓ = 1 \sum n (p_{t} (x_{i}, x_{ℓ}) - p_{t} (x_{j}, x_{ℓ}))^{2} w (x_{ℓ}) .

D_{t} (x_{i}, x_{j}) = ∥ p_{t} (x_{i}, \cdot) - p_{t} (x_{j}, \cdot) ∥_{l^{2} (w)} = ℓ = 1 \sum n (p_{t} (x_{i}, x_{ℓ}) - p_{t} (x_{j}, x_{ℓ}))^{2} w (x_{ℓ}) .

(P^{t})_{ij} = ℓ = 1 \sum n λ_{ℓ}^{t} ψ_{ℓ} (x_{i}) φ_{ℓ} (x_{j}),

(P^{t})_{ij} = ℓ = 1 \sum n λ_{ℓ}^{t} ψ_{ℓ} (x_{i}) φ_{ℓ} (x_{j}),

D_{t} (x_{i}, x_{j}) = ∥ p_{t} (x_{i}, \cdot) - p_{t} (x_{j}, \cdot) ∥_{l^{2} (1/ π)} = ℓ = 1 \sum n λ_{ℓ}^{2 t} (ψ_{ℓ} (x_{i}) - ψ_{ℓ} (x_{j}))^{2} .

D_{t} (x_{i}, x_{j}) = ∥ p_{t} (x_{i}, \cdot) - p_{t} (x_{j}, \cdot) ∥_{l^{2} (1/ π)} = ℓ = 1 \sum n λ_{ℓ}^{2 t} (ψ_{ℓ} (x_{i}) - ψ_{ℓ} (x_{j}))^{2} .

D_{t} (x_{i}, x_{j}) \approx ℓ = 1 \sum M λ_{ℓ}^{2 t} (ψ_{ℓ} (x_{i}) - ψ_{ℓ} (x_{j}))^{2} .

D_{t} (x_{i}, x_{j}) \approx ℓ = 1 \sum M λ_{ℓ}^{2 t} (ψ_{ℓ} (x_{i}) - ψ_{ℓ} (x_{j}))^{2} .

x_{i} \mapsto (λ_{1}^{t} ψ_{1} (x_{i}), λ_{2}^{t} ψ_{2} (x_{i}), \dots, λ_{M}^{t} ψ_{M} (x_{i}))

x_{i} \mapsto (λ_{1}^{t} ψ_{1} (x_{i}), λ_{2}^{t} ψ_{2} (x_{i}), \dots, λ_{M}^{t} ψ_{M} (x_{i}))

p (x) = y \in NN_{k} (x) \sum exp (- ∥ x - y ∥_{2}^{2} / σ_{0}^{2}),

p (x) = y \in NN_{k} (x) \sum exp (- ∥ x - y ∥_{2}^{2} / σ_{0}^{2}),

ρ_{t} (x) = {min {D_{t} (x, y) ∣ p (y) \geq p (x), x \neq = y}, max_{y \in X} D_{t} (x, y), x \neq = arg max_{z} p (z) x = arg max_{z} p (z)

ρ_{t} (x) = {min {D_{t} (x, y) ∣ p (y) \geq p (x), x \neq = y}, max_{y \in X} D_{t} (x, y), x \neq = arg max_{z} p (z) x = arg max_{z} p (z)

D_{t} (x) = p (x) ρ_{t} (x) .

D_{t} (x) = p (x) ρ_{t} (x) .

D_{t}^{in}

D_{t}^{in}

D_{t}^{btw}

M = {x \in X ∣ \exists k such that x = y \in X_{k} arg max p (y)},

M = {x \in X ∣ \exists k such that x = y \in X_{k} arg max p (y)},

D_{t}^{in} / D_{t}^{btw} < max (M) / min (M),

D_{t}^{in} / D_{t}^{btw} < max (M) / min (M),

P (C) = \frac{1}{n} k = 1 \sum K ∣ {x_{i} \in C_{k} ∣ y_{i} = \overset{y}{ˉ}_{k}} ∣.

P (C) = \frac{1}{n} k = 1 \sum K ∣ {x_{i} \in C_{k} ∣ y_{i} = \overset{y}{ˉ}_{k}} ∣.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Learning by Active Nonlinear Diffusion

[email protected]

Abstract.

This article proposes an active learning method for high dimensional data, based on intrinsic data geometries learned through diffusion processes on graphs. Diffusion distances are used to parametrize low-dimensional structures on the dataset, which allow for high-accuracy labelings of the dataset with only a small number of carefully chosen labels. The geometric structure of the data suggests regions that have homogeneous labels, as well as regions with high label complexity that should be queried for labels. The proposed method enjoys theoretical performance guarantees on a general geometric data model, in which clusters corresponding to semantically meaningful classes are permitted to have nonlinear geometries, high ambient dimensionality, and suffer from significant noise and outlier corruption. The proposed algorithm is implemented in a manner that is quasilinear in the number of unlabeled data points, and exhibits competitive empirical performance on synthetic datasets and real hyperspectral remote sensing images.

Key words and phrases:

Active learning, statistical learning, diffusion geometry, machine learning, spectral graph theory

1991 Mathematics Subject Classification:

Primary: 58F15, 58F17; Secondary: 53C35.

This research is supported by NSF-DMS-125012, NSF-DMS-1724979, NSF-DMS-1708602, NSF-ATD-1737984, AFOSR FA9550-17-1-0280, NSF-IIS-1546392.

∗ Corresponding author: James M. Murphy

Mauro Maggioni

Department of Mathematics, Department of Applied Mathematics and Statistics,

Mathematical Institute of Data Sciences, Institute of Data Intensive Engineering and Science

Johns Hopkins University

Baltimore, MD 21218, USA

James M. Murphy∗

Department of Mathematics

Tufts University

Medford, MA 02155, USA

1. Introduction

Statistical and machine learning techniques are revolutionizing the sciences. Advances in medical diagnosis [1], automatic game playing [2], and computer vision [3] have been sparked by seismic advances in computational power and innovative learning algorithms and architectures. However, many state-of-the-art machine learning approaches are predicated on the availability of huge labeled data sets that may be used to train the parameters of the underlying algorithms. Unfortunately, many important scientific problems do not have large, accurately labeled training sets readily available. This limits the practicality of many state-of-the-art supervised methods. Moreover, several fields—medicine and remote sensing, for example—are not amenable to easily generating new labeled data points at scale, due to the high cost of labeling data points. This renders the applicability of many state-of-the-art supervised learning algorithms—including modern deep learning methods which may depend on millions of parameters—problematic, as generating sufficient training data may be resource-intensive.

When training datasets do not exist or are burdensome to generate, alternative methods may be used to exploit the glut of unlabeled data. Data augmentation [4, 5] may be used to generate new labeled training points by, for example, perturbing existing training points in a suitable manner. Unsupervised methods—those using no training (labeled) data at all—are ideal when insufficient training data is available, as they work entirely on the unlabeled data. However, unsupervised methods may be inadequate for highly complex data. Indeed, such approaches enjoy performance guarantees only when rigid geometrical or statistical properties are made on the data [6, 7, 8, 9, 10, 11]. Methods that are semi-supervised [12] provide a middle ground between the supervised (abundant labeled data for training) and unsupervised (no labeled data for training) regimes, taking advantage of large quantities of unlabeled data while still allowing labeled points to influence classification. When the unsupervised structure of the data (e.g. its geometric or statistical properties) are compatible with the labels of the data, semisupervised learning may improve over unsupervised learning and also over classical supervised learning with the same fixed labeled training data.

This article proposes an active learning scheme for high-dimensional datasets exhibiting intrinsically low-dimensional structure. Active learning is a form of semi-supervised learning in which an algorithm uses the unlabeled data to determine which data points to query for labels. In the proposed method, the geometry of the data is parametrized through diffusion processes defined on a data-dependent graph [13, 14], which are robust to high ambient dimensionality, noise, and non-spherical cluster shapes. The inferred geometry—which is computed without supervision—is then analyzed to determine which data points should be queried for labels; the query points are chosen to have maximum impact, so that relatively few are needed to achieve good empirical performance. The proposed active learning scheme is called learning by active nonlinear diffusion (LAND).

1.1. Major Contributions and Article Outline

The major contributions of this article are twofold. First, LAND is proposed and is proven to perform well for data generated according to a flexible geometric data model. With only a small number of queries, LAND achieves perfect accuracy even for data that is high-dimensional, contains classes that are highly nonlinear or non-compact, and is corrupted by significant noise and outliers. The theoretical results are derived from an analysis of the underlying diffusion distances, which in turn are amenable to analysis using techniques from spectral graph theory and the analysis of Markov chains.

Second, the proposed method is implemented numerically. Taking advantage of fast nearest neighbor search algorithms and eigensolvers for sparse matrices, the proposed method is proven to enjoy quasilinear complexity in the number of sample points under the proposed data model, which supposes that the underlying data has intrinsically small dimensionality (in the sense of lying close to a low-dimensional manifold, for example). LAND is demonstrated on synthetic datasets as well as real hyperspectral images, demonstrating its suitability for high-dimensional geometric data.

The remainder of the article is organized as follows. Background on active learning and diffusion geometry are presented in Section 2. The geometric data model and algorithm are proposed and analyzed in Section 3. Comparisons with related works are also presented in Section 3. Numerical experiments are in Section 4. Conclusions and future research directions are in Section 5.

2. Background

The proposed active learning algorithm exploits the underlying diffusion geometry of data to efficiently determine points to query for labels. In this section, we review active learning as well as diffusion geometry.

2.1. Background on Active Learning

Active learning is a type of semisupervised learning in which unlabeled data is analyzed to determine which points to query for labels [15]. It differs from traditional semisupervised learning in that the labeling algorithm is permitted to ask for the labels of certain points, instead of being provided with a random sample of labeled points. Under certain data models and methods for parsimoniously selecting query points, the active learning approach can perform as well as traditional semi-supervised or even supervised learning, with far fewer labels [16, 17]. The crucial theoretical question is how to determine which data points should be queried for labels. The active learning framework assumes there is an underlying budget that can be spent to label points. This budget should be spent carefully, in order to only query points that are most likely to prove significant for the overall labeling of the data.

Approaches to active learning may be categorized into two general strategies: hypothesis space reduction and cluster exploitation [18]. The first category conceives of supervised learning as a process of using training points to select a “good” classifier from a large space of possible classifiers. Asymptotically, as the number of labeled sample points $n_{\ell}\rightarrow\infty$ , a consistent supervised learning procedure converges to an optimal classifier. In practice, the rate of convergence in $n_{\ell}$ is relevant—the faster the rate of convergence, the better the learning algorithm. In this framework, active learning is a family of methods for selecting query points such that the convergence rate towards a good classifier is fast in $n_{\ell}$ , in particular faster than passive sampling methods, for example sampling labels uniformly at random. That is, query points should be influential in distinguishing between different possible classifiers, and should allow for convergence towards the “optimal” classifier with fewer points than if the labeled points were selected uniformly at random. These active learning approaches can, in certain cases, significantly improve the expected error rate of the classifier as a function of $n_{\ell}$ [19, 17, 20, 21, 22].

A second category of active learning approaches seek to exploit cluster structure in the data in order to emphasize sampling near complex regions of the data with heterogeneous labels, and to avoid oversampling near simple, homogeneous regions of the data. Indeed, if a cluster—detected through a prescribed clustering algorithm—can be estimated as relatively pure with respect to its labels, then it may be efficient to simply give all points in the cluster the same label and to focus the limited querying resources in more ambiguous regions. A crucial problem in this framework is to tap the budget in a way that balances two different tasks: confirming the label homogeneity of particular data regions and exploring new data regions. Methods based on iteratively pruning hierarchical clustering trees have been proposed [23] and analyzed in terms of label smoothness with respect to the scales of the tree [24].

The method proposed in this paper is related to the second category, and exploits the underlying geometry of the data sample in order to estimate the most impactful points to query for labels. In order to develop notions of cluster geometry that are robust to being embedded in a high dimensional space, to being non-spherical in shape, and to corruption by noise and outlier points, the diffusion geometry of the underlying data is estimated and used as the basis for all subsequent pairwise comparisons. This provides a set of (essentially) geometrically intrinsic coordinates for the data that are robust to dimensionality, nonlinearity, and noise.

An example of synthetic toy data for which diffusion geometry notably decreases the number of active learning queries necessary for good accuracy appears in Figure 1. The role of diffusion geometry is crucial to the proposed method, and it is reviewed in detail in Section 2.2.

2.2. Background on Diffusion Geometry

Let $X=\{x_{i}\}_{i=1}^{n}\subset\mathbb{R}^{D}$ be discrete data. The diffusion geometry of $X$ is learned through Markov diffusion processes defined on a graph with nodes corresponding to the points $\{x_{i}\}_{i=1}^{n}$ and transition probabilities proportional to the similarities of these points in some metric [13, 14]. That is, points that are nearby have high probabilities of pairwise transition, and points that are far apart have low probabilities of transition. By analyzing the diffusion process across time scales, natural geometric structure in the data can be inferred.

More precisely, let $\mathcal{G}=(X,W)$ be a weighted, undirected graph with nodes $X$ and weight $W_{ij}\in[0,1]$ between $x_{i},x_{j}\in X$ . Typically $W_{ij}=\mathcal{K}(x_{i},x_{j})$ for some symmetric, radial kernel $\mathcal{K}:\mathbb{R}^{D}\times\mathbb{R}^{D}\rightarrow[0,1]$ . The weight matrix $W$ is normalized to produce a Markov transition matrix $P=D^{-1}W$ , where $D$ is the diagonal degree matrix with $D_{ii}=\sum_{j=1}^{n}W_{ij}$ . The matrix $P$ is row-stochastic, and diffusion distances measure how similar points are according to their transition probabilities in $P$ .

Definition 2.1.

Let $P$ be a Markov transition matrix defined on $X=\{x_{i}\}_{i=1}^{n}$ . Let $p_{t}(x_{i},x_{j})=(P^{t})_{ij}$ . The diffusion distance between $x_{i}$ and $x_{j}$ at time $t$ with respect to weight $w:X\rightarrow[0,\infty)$ is

[TABLE]

The time parameter $t$ is a global time scale at which the diffusion process runs. For small $t$ , the process has run for a short amount of time, which may prevent important, large scale geometric structures in the data from impacting the diffusion distances. On the other extreme, the diffusion distances all collapse to 0 as $t\rightarrow\infty$ , under the assumption that $P$ is ergodic, since $P^{t}$ converges to the rank 1 matrix with the stationary distribution $\pi$ as rows, where $\pi P=\pi$ . When the data has underlying geometric structure, $t$ parametrizes multiscale hierarchy, with small $t$ realizing fine-scale structures and $t$ large realizing coarse-scale structures [25, 26, 27].

While $P$ is not symmetric, it is diagonally conjugate to a symmetric matrix: $D^{\frac{1}{2}}PD^{-\frac{1}{2}}=D^{-\frac{1}{2}}WD^{-\frac{1}{2}}$ . Hence, $P$ admits a spectral decomposition, which can be exploited for computations of diffusion distance. More precisely, let $\{(\lambda_{\ell},\phi_{\ell})\}_{\ell=1}^{n}$ be the eigenvalues and eigenvectors of $D^{-\frac{1}{2}}WD^{-\frac{1}{2}}$ , sorted so that $1=\lambda_{1}>|\lambda_{2}|\geq\dots\geq|\lambda_{n}|$ . Then

[TABLE]

where $\psi_{\ell}(x_{i})=\phi_{\ell}(x_{i})/\sqrt{\pi(x_{i})},\varphi_{\ell}(x_{j})=\phi_{\ell}(x_{j})\sqrt{\pi(x_{j})}$ . If $\{\psi_{\ell}\}_{\ell=1}^{n},\{\varphi_{\ell}\}_{\ell=1}^{n}$ are understood as column vectors, this is equivalent to the decomposition $P^{t}=\sum_{\ell=1}^{n}\lambda_{\ell}^{t}\psi_{\ell}\varphi_{\ell}^{\top}$ . In particular, $\{\varphi_{\ell}\}_{\ell=1}^{n}$ is an orthonormal basis for $l^{2}(1/\pi)$ , so that diffusion distances with respect to the weight $w(x_{i})=1/\pi(x_{i})$ may be written in terms of $\{\phi_{\ell}\}_{\ell=1}^{n}$ :

[TABLE]

If the underlying transition matrix $P$ is approximately low rank, the modulus of the eigenvalues $\{\lambda_{\ell}\}_{\ell=1}^{n}$ decays rapidly, so that for $t$ sufficiently large, this sum may be truncated after $M=O(1)$ eigenpairs yielding the approximate diffusion distances

[TABLE]

This truncation has the added benefit of denoising the diffusion distances, since the eigenvectors associated with eigenvalues away from 1 in modulus (in some sense the high frequency eigenvectors) correspond not to intrinsic geometric structures in the data, but to random fluctuations produced by sampling [28]. The embedding

[TABLE]

may be understood as a form of nonlinear dimension reduction, and also as a set of (essentially) geometrically intrinsic coordinates for the data [29].

3. Proposed Algorithm and Analysis

Let $\{x_{i}\}_{i=1}^{n}\subset\mathbb{R}^{D}$ . The LAND algorithm requires determining which points should be queried for labels. This is done by estimating modes of the nonlinear clusters in the data through a combination of density estimation and the diffusion geometry of the data.

Let $p:\mathbb{R}^{D}\rightarrow[0,\infty)$ be a kernel density estimator, for example

[TABLE]

where $\text{NN}_{k}(x)$ are the $k$ -nearest neighbors of $x$ in Euclidean distance, and $\sigma_{0}$ is a scaling parameter. Let $D_{t}$ be the diffusion distance metric on $X$ , and let

[TABLE]

be the ( $t$ -dependent) diffusion distance between a point and its nearest diffusion neighbor of higher density if $x$ is not the maximizer of $p(x)$ , and the maximum diffusion distance to another point if $x$ is the maximizer of $p(x)$ . The modes of the data are determined through the quantity

[TABLE]

Points will have a large $\mathcal{D}_{t}$ value if they are high density and are $D_{t}$ -far from other high density points. Following [27], we characterize the modes of $X$ as the maximizers of $\mathcal{D}_{t}$ . This notion is robust to data geometry—as captured by diffusion distances—and provides a multiscale hierarchy to the structure of the data. See Figure 2 for an illustration of how $\mathcal{D}_{t}$ changes with time.

3.1. Learning by Unsupervised Nonlinear Diffusion

In [27], the maximizers of $\mathcal{D}_{t}$ were proposed as cluster modes, and diffusion distances and density were used to label all other points relative to these modes. We summarize this unsupervised learning algorithm, called learning by unsupervised nonlinear diffusion (LUND) in Algorithm 1. This algorithm was proven to perfectly cluster certain data, for an appropriate choice of time parameter $t$ , and is robust to non-spherical data geometries and cluster overlap.

It was shown that, depending on the well-connectedness of the clusters compared to their separations, the range of $t$ for which Algorithm 1 performs well may be large [27]. However, developing methods for estimating an appropriate choice of $t$ without using any labeled data is an important and only partially addressed problem. Indeed, if the data admits hierarchical cluster structure, then several choices of $t$ may be appropriate, leading to different reasonable clusterings. In this context, querying a small number of points for labels can disambiguate between these different clusterings.

3.2. Learning by Active Nonlinear Diffusion

In the active learning setting, we characterize potential classes as being composed of $D_{t}$ -orbits around the maximizers of $\mathcal{D}_{t}$ . These orbits partition the data, and are comparable to elements of a Voronoi tessellation [30]. In the case that the labels for the data are smooth with respect to this partition, querying the maximizers of $\mathcal{D}_{t}$ is a more efficient use of a sampling budget than uniform random sampling. The proposed algorithm, denoted Learning by Active Nonlinear Diffusion (LAND) appears in Algorithm 2.

3.2.1. Analysis of LAND

From a theoretical standpoint, it is of interest to know when querying a small number of points (Algorithm 2) offers substantial be benefit compared to unsupervised learning (Algorithm 1). Suppose that the underlying data consists of distinct classes $X=\bigcup_{k=1}^{K}X_{k}$ , with all points in $X_{k}$ having label $k$ . Let

[TABLE]

be the maximum within-class and minimum between-class diffusion distances at time $t$ , respectively. Let

[TABLE]

$\max(\mathcal{M})=\max_{x\in\mathcal{M}}p(x),\min(\mathcal{M})=\min_{x\in\mathcal{M}}p(x)$ be the density maximizers of the distinct classes, the maximum density among such classwise maximizers, and the minimum density among the classwise maximizers, respectively. In [27], it is shown that if

[TABLE]

then the data can be labeled in a fully unsupervised manner by Algorithm 1. However, the underlying density conditions may not be satisfied in practice, particularly if there are strong discrepancies between the density of the most dense point in each cluster. Moreover, (2) depends strongly on $t$ . Introducing the active learning scheme allows to bypass this potentially stringent density condition and still achieve perfect accuracy, at the cost of querying the labels of a small number of points.

Theorem 3.1.

Let $X=\bigcup_{k=1}^{K}$ be data to classify. Suppose that $D_{t}^{\text{in}}<D_{t}^{\text{btw}}$ , and that the $B$ maximizers of $\mathcal{D}_{t}$ include the elements of $\mathcal{M}$ . Then LAND with a budget of size $B$ achieves perfect classification accuracy.

Proof.

If the $B$ maximizers of $\mathcal{D}_{t}$ include all the density maximizers of the distinct classes, that is, the elements of $\mathcal{M}$ , then the LAND queries guarantee these points are all labeled correctly. Then the result follows by induction on the data points sorted in order of decreasing $p(x)$ value. Indeed, for an unlabeled point $x\in X_{k}$ , its nearest diffusion neighbor of higher density, $x^{*}$ , must be in the same class $X_{k}$ , since $D_{t}^{\text{in}}<D_{t}^{\text{btw}}$ . Moreover, that point is already labeled as $Y(x^{*})=k$ , since $p(x^{*})\geq p(x)$ . Hence, $Y(x)=k$ .

∎

Theorem 3.1 asserts that the LAND algorithm achieves perfect accuracy as long as $D_{t}^{\text{in}}<D_{t}^{\text{btw}}$ and $B$ is large enough so that all elements of $\mathcal{M}$ are among the $B$ maximizers of $\mathcal{D}_{t}$ . Compared to LUND, LAND does not require that $D_{t}^{\text{in}}/D_{t}^{\text{btw}}<\max(\mathcal{M})/\min(\mathcal{M})$ to guarantee strong performance. This is an important point in practice, since the density between different regions of the data may vary considerably. Ultimately, active learning is most useful when the budget $B$ may be taken very small compared to $n$ ; we shown in Section 4 that even a budget of just a few points may significantly improve accuracy on synthetic and real datasets.

3.3. Comparison with Related Methods

It is natural to compare LAND with related cluster-based active learning methods, as well as its unsupervised variant LUND.

3.3.1. Comparisons with Related Active Learning Methods

As discussed in Section 2.1, active learning methods may be categorized as falling into two broad classes: those based on refining the hypothesis space of classifiers, and those based on exploiting cluster structure in the data. LAND falls into the second category; it is thus natural to compare it with existing cluster-based active learning algorithms.

Many active learning algorithms that exploit cluster structure in the data proceed by constructing a hierarchical clustering on the data, often represented in the form of a dendrogram [31]. Given such a structure, sample queries are made in order to explore heterogeneous regions of the tree (leaves with highly mixed labels) and to avoid sampling from homogeneous regions of the data (leaves that consist mostly of a single class). The key challenge is to balance the cost of exploring ambiguous regions of the data with establishing the homogeneity of other regions.

Efficient algorithms that are statistically consistent have been proposed [23] and analyzed using the notion of “probabilistic Lipschitzness,” which quantifies purity of leaves of the hierarchical clustering [24]. These approaches make analyzing the hierarchical tree the central problem; the problem of whether or not a particular method for constructing a hierarchical tree is appropriate or not is not directly considered. Indeed, it is common to construct the underlying hierarchical tree with standard methods, for example average-linkage clustering [23] or single linkage clustering [31]. Despute their pervasiveness, these methods for constructing hierarchical trees suffer from a lack of robustness to pernicious chains in the data (single-linkage) and geometric distortion (average-linkage). Active learning based on hierarchical trees performs well when the leaves of the tree become pure quickly when descending from the root node; if the underlying tree does not exhibit pure leaves until relatively deep in the tree, many samples are required for active learning, and the method may not improve substantially over random sampling.

Unlike average linkage and single linkage clustering, the proposed LAND method explicitly incorporates the underlying geometry of the data to construct clusters of multiscale granularity, which can then be exploited for active querying. The LAND algorithm may be interpreted as a method for constructing the underlying hierarchical tree, which has the desirable property that the leaves are essentially robust to geometric transformations of the underlying clusters, i.e. to making the clusters elongated or nonlinear. Indeed, given a number of clusters $K$ , one can run a variant of LUND in which $K$ is input as a parameter; see Algorithm 3.

It is then natural to compare the purity of the nodes of a hierarchical tree at scale $K$ , with the purity of the clusters learned by Algorithm 3 with number of clusters equal to $K$ . More generally, let $\mathcal{C}=\{C_{k}\}_{k=1}^{K}$ be a clustering of labeled data $\{(x_{i},y_{i})\}_{i=1}^{n}$ . Let $\bar{y}_{k}$ be the most common label among the points in $C_{k}$ . The purity of the clustering $\mathcal{C}$ is defined as

[TABLE]

Given a hierarchical clustering $\{\mathcal{C}_{\ell}\}_{\ell=1}^{n}$ —that is, $\mathcal{C}_{1}$ consists of 1 cluster with all points, $\mathcal{C}_{n}$ consists of $n$ singleton clusters, and $\mathcal{C}_{\ell+1}$ is the same clustering as $\mathcal{C}_{\ell}$ , but with two of the clusters split— the purity of the clustering at the $\ell^{th}$ scale is $\mathcal{P}(\mathcal{C}_{\ell})$ . Clearly $\mathcal{P}(\mathcal{C}_{\ell})$ is non-decreasing as a function of $\ell$ , and $\mathcal{P}(\mathcal{C}_{n})=1.$ If the growth of $\mathcal{P}(\mathcal{C}_{\ell})$ towards 1 is rapid in $\ell$ , then an active sampler does not need to search deeply into the tree to find regions with homogeneous labels. In Figure 3, a plot of $\mathcal{P}(\mathcal{C}_{\ell})$ is shown for three synthetic datasets with three different families of clusterings: single linkage clusters, average linkage clusters, and the clusters learned by Algorithm 3.

We see that for the geometric data, the clusters learned from average linkage clustering achieve high purity much later than the clusterings learned with single linkage clustering. This is due to the inability of average linkage to account for the nonlinear and elongated shapes of these clusters. Indeed, the opposite ends of the elongated cluster are quite far apart when measured with the average linkage metric, but are much closer when diffusion distances are used. On the other hand, the bottleneck and Gaussian data illustrate how single linkage clusters may take a long time to achieve high purity, due to the fact that single linkage clustering is guided only by density, and is not robust to adversarial paths of points connecting two otherwise well-separated clusters. Compared to single linkage and average linkage clustering, the clusters learned by LUND are robust to geometric distortions, adversarial paths, and noise.

3.3.2. Comparison with LUND

The proposed LAND algorithm (Algorithm 2) integrates an active learning criterion into the LUND algorithm (Algorithm 1). It has been shown that when the the classes of the data $X$ are sufficiently coherent and pairwise well-separated, LUND with a good choice of $t$ perfectly labels all data points [27]. The unsupervised LUND algorithm depends critically on $t$ , and the robustness of LUND to this choice of parameter suggests its usefulness. However, developing practical methods for estimating a good choice of $t$ may be challenging in data that admits hierarchical cluster structure. Indeed, consider the data in Figure 4. For this data, it is ambiguous whether there are two or four clusters. Indeed, as shown in Figure 4 (c), if $\log_{10}(t)\in[1,3]$ , LUND estimates there are 4 clusters. If the time parameter satisfies $\log_{10}(t)\in[4,6]$ , LUND estimates there are 2 clusters. This is a fundamental ambiguity in unsupervised clustering, and one can view the ability of hierarchical clustering algorithms, and of LUND (depending on the time scale $t$ ) to detect the different possibilities for the number of clusters as a strength. Partial supervision allows for disambiguation in these situations.

Indeed, with a very small (4) labeled queries, LAND is able to overcome this obstacle and determine the labels of the data. This is because even for large $t$ , the top four values of $\mathcal{D}_{t}$ correspond to the modes of the four Gaussian clusters, and the diffusion distances within these clusters are quite small. In the unsupervised case, for large $t$ , the gap between the within-cluster and between-cluster distances for the four clusters are dwarfed by the the gap between the within-cluster and between-cluster distances for the two clusters, leading to ambiguity. That is, when the underlying data is grouped into 2 clusters, $D_{t}^{\text{in}}/D_{t}^{\text{btw}}$ is large for large $t$ and small for small $t$ ; when the underlying data is grouped into 4 clusters, $D_{t}^{\text{in}}/D_{t}^{\text{btw}}$ is large for small $t$ and small for large $t$ . These lead to inherent ambiguity in how to choose $t$ in a fully unsupervised manner. However, by bringing in just 4 labels, LAND is able to correctly label the dataset for both large and small $t$ values, as can be seen from Figure 6. In this sense, LAND introduces robustness to the time parameter that may be problematic in LUND, at the cost of querying a small number of points.

3.4. Computational Complexity and Implementation

The proposed Algorithm 2 has computational complexity depending crucially on the number of data points to label ( $n$ ), the ambient dimensionality of the data ( $D$ ), and the intrinsic dimensionality of the data ( $d$ ).

Theorem 3.2.

Let $\{x_{i}\}_{i=1}^{n}\subset\mathbb{R}^{D}$ be data to label. Suppose all except for $O(\log(n))$ points have a higher density point within its $O(\log(n))$ $D_{t}$ -nearest neighbors. In the case that a $k_{\text{NN}}$ -sparse matrix $P$ is used, the LAND algorithm has complexity $O(C_{\text{NN}}+nk_{\text{NN}}+n\log(n)))$ , where $C_{\text{NN}}$ is the cost of computing all $k_{\text{NN}}$ nearest neighbors.

Proof.

The construction of the Markov transition matrix $P$ has complexity $O(C_{\text{NN}})$ . The subsequent kernel density estimation for all points is then $O(nk_{\text{NN}})$ . The computation of $\rho_{t}$ for all points is $O(n\log(n))$ , where we assume that all except for $O(\log(n))$ points has a higher density point within its $O(\log(n))$ $D_{t}$ -nearest neighbors. To estimate the modes from $\mathcal{D}_{t}$ requires sorting $n$ values, so has complexity $O(n\log(n))$ . Once the modes are estimated, labeling all points has complexity $O(n\log(n))$ by the assumption that all except for $O(\log(n))$ points has a higher density point within its $O(\log(n))$ $D_{t}$ -nearest neighbors. The result follows. ∎

In the worst case, $C_{\text{NN}}=n^{2}$ , so that LAND has quadratic complexity in $n$ . When the data has intrinsically low-dimensional structure, fast nearest neighbor searches reduce this complexity to be quasilinear in $n$ .

Corollary 3.1.

Let $\{x_{i}\}_{i=1}^{n}\subset\mathbb{R}^{D}$ be data to label. When the underlying data is intrinsically $d$ -dimensional structure (in the sense of doubling dimension) and when $k_{\text{NN}}\ll\log(n)$ , LAND has computational complexity $O(DC^{d}n\log(n)^{2}).$

Proof.

In the case that the data has intrinsically low-dimensional structure in the sense of doubling dimension, the cover tree indexing structure [32] may be used so that to compute each points $k_{\text{NN}}$ has complexity $O(DC^{d}k_{\text{NN}}n\log(n))$ . The result follows. ∎

Corollary 3.1 suggests that the proposed algorithm is appropriate for large numbers of data points $n$ in high dimension, provided that the intrinsic dimensionality of the data is small.

4. Experimental Analysis

We perform experiments on three representative synthetic datasets, as well as two real hyperspectral images 111http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes. Comparisons are made between LAND and two related methods:

(1)

LAND with random query points. This algorithm consists of Algorithm 2, but with random points selected for querying, rather than the maximizers of $\mathcal{D}_{t}$ . Comparison with LAND will suggest if the query points determined by diffusion geometry and density—as captured by $\mathcal{D}_{t}$ —are actually of significant value. 2. (2)

Cluster-based active learning (CBAL). This algorithm [23] is implemented using a hierarchical tree constructed from average linkage clustering.

Three performance metrics are used to compare the active learning results. Overall accuracy (OA) is the ratio of correctly labeled pixels to the total number of pixels. Average accuracy (AA) averages the OA of each class, equalizing the significance of small and large classes. *Cohen’s * $\kappa$ -statistic ( $\kappa$ ) is a measure of agreement between two labelings that is robust to random chance [33].

4.1. Experiments on Synthetic Data

Experimental results on the three synthetic datasets introduced in Figure 3 are shown in Figure 7, illustrating the efficacy of LAND. In all cases, LAND achieves near perfect accuracy with fewer than 10 labels, while the comparison methods converge to high accuracy much more gradually.

4.2. Experiments on Hyperspectral Data

In order to illustrate the efficacy of LAND on real data, we demonstrate its performance on hyperspectral imagery, which constitutes an important data type in the remote sensing of the environment [34]. An HSI is an image consisting of $D$ spectral bands, each localized to a narrow electromagnetic range. The concatenation of these $D$ spectral bands provides highly detailed information about the materials being imaged, and can allow for precise discrimination of specific objects in the scene. While nominally a 3-dimensional tensor, an HSI is often analyzed by collapsing the spatial coordinates to produce a dataset $\{x_{i}\}_{i=1}^{n}\subset\mathbb{R}^{D}$ , where $n$ is the total number of pixels in the image and $D$ is the total number of spectral bands. When large training sets of labeled pixels are available, classification of an HSI scene may be effectively performed using a range of techniques, including support vector machines [35], deep learning [36], and random forests [37].

Traditional supervised learning has led to strong empirical performance for HSI classification. However, supervised learning for HSI—particularly state-of-the-art deep learning—is predicated on the availability of large labeled training sets, which must be collected and annotated, typically by human experts. The need for large training sets is exacerbated by the high dimensionality of the data. The collection of large training sets may not be practical in the context of HSI, where there is a huge number of possible classes and large variabilities are introduced by sensing conditions. Indeed, the task of generating huge training sets for general HSI imagery is quite onerous, and may even require the deployment of humans to observe physically the scene that has been remotely sensed, which is very resource intensive. It is thus crucial to develop methods that can label HSI with no labeled training data [38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48] or a combination of labeled and unlabeled data [49, 50, 51].

Active learning for HSI is an important method for achieving high-accuracy classification results, without requiring large labeled training sets [52, 53, 54, 55, 56, 57]. These methods typically query for labels points near the boundaries of classes, thus improving the convergence of the learning algorithm towards a good classifier. LAND, on the other hand, exploits cluster structure in the data.

4.2.1. Experimental Results for HSI

We perform active learning experiments on two real HSI datasets, shown in Figure 8 and 9, respectively.

Experimental results for the three methods on the Salinas A and Pavia datasets are shown in Figure 10. For the Salinas A dataset, accuracy with LAND is strong, with only 10 labels leading to highly accurate empirical results, and subsequent labels leading to rapid improvement towards perfect accuracy. In particular, compared to using random query labels or CBAL, the improvement of LAND as a function of the number of queries is rapid. For the Pavia dataset, there is a similar early jump in accuracy for LAND, while the improvement is slower for the comparison methods.

5. Conclusions and Future Work

The LAND algorithm integrates diffusion geometry and density estimation to efficiently estimate query points that are highly impactful on overall labeling accuracy in the active learning setting. Our theoretical and empirical analyses show LAND’s robustness to geometric distortions of the underlying data classes, and our experiments on real-world HSI demonstrate its effectiveness in accurately labeling high-dimensional datasets with a very small number of query points.

In the context of HSI, developing active learning methods that incorporate spatial proximity into the underlying diffusion process is of interest. This information may suggest that it is useful to query information in a spatially homogeneous region, where it can be most impactful. The integration of spatial information into a variant of the LUND algorithm adapted for HSI has proven effective [57, 48], and it is likely that such information would similarly boost the effectiveness of LAND.

It is of interest to develop a cross-validation scheme that exploits the active learning queries in order to iteratively update the optimal choice of time parameter $t$ . Indeed, as argued in Section 3.3.2, the use of a very small (essentially $O(K)$ ) active learning queries can be used to achieve robustness to the parameter $t$ , which is critically important in the LUND algorithm. However, it may be possible to update the time parameter in an iterative fashion, by selecting at each time step a time scale that separates all the modes learned so far, before querying a new point. This has the potential to require fewer queries to learn all the classes, since the parameter is being adaptively optimized at each time step, rather than after all queries have been made.

Bibliography57

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Esteva, B. Kuprel, R.A. Novoa, J. Ko, S.M. Swetter, H.M. Blau, and S. Thrun. Dermatologist-level classification of skin cancer with deep neural networks. Nature , 542(7639):115, 2017.
2[2] D. Silver, A. Huang, C.J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, and M. Lanctot. Mastering the game of go with deep neural networks and tree search. nature , 529(7587):484, 2016.
3[3] A. Krizhevsky, I. Sutskever, and G.E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems , pages 1097–1105, 2012.
4[4] M.A. Tanner and W.H. Wong. The calculation of posterior distributions by data augmentation. Journal of the American statistical Association , 82(398):528–540, 1987.
5[5] D.A. Van Dyk and X.-L. Mengi. The art of data augmentation. Journal of Computational and Graphical Statistics , 10(1):1–50, 2001.
6[6] E. Arias-Castro. Clustering based on pairwise distances when the data is of mixed dimensions. IEEE Transactions on Information Theory , 57(3):1692–1706, 2011.
7[7] E. Arias-Castro, G. Lerman, and T. Zhang. Spectral clustering based on local PCA. Journal of Machine Learning Research , 18(9):1–57, 2017.
8[8] G. Schiebinger, M.J. Wainwright, and B. Yu. The geometry of kernelized spectral clustering. The Annals of Statistics , 43(2):819–846, 2015.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Learning by Active Nonlinear Diffusion

Abstract.

Key words and phrases:

1991 Mathematics Subject Classification:

1. Introduction

1.1. Major Contributions and Article Outline

2. Background

2.1. Background on Active Learning

2.2. Background on Diffusion Geometry

Definition 2.1**.**

3. Proposed Algorithm and Analysis

3.1. Learning by Unsupervised Nonlinear Diffusion

3.2. Learning by Active Nonlinear Diffusion

3.2.1. Analysis of LAND

Theorem 3.1**.**

Proof.

3.3. Comparison with Related Methods

3.3.1. Comparisons with Related Active Learning Methods

3.3.2. Comparison with LUND

3.4. Computational Complexity and Implementation

Theorem 3.2**.**

Proof.

Corollary 3.1**.**

Proof.

4. Experimental Analysis

4.1. Experiments on Synthetic Data

4.2. Experiments on Hyperspectral Data

4.2.1. Experimental Results for HSI

5. Conclusions and Future Work

Definition 2.1.

Theorem 3.1.

Theorem 3.2.

Corollary 3.1.