A Sampling Theory Perspective of Graph-based Semi-supervised Learning

Aamir Anis; Aly El Gamal; Salman Avestimehr; Antonio Ortega

arXiv:1705.09518·cs.LG·April 16, 2019

A Sampling Theory Perspective of Graph-based Semi-supervised Learning

Aamir Anis, Aly El Gamal, Salman Avestimehr, Antonio Ortega

PDF

TL;DR

This paper offers a theoretical framework for understanding graph-based semi-supervised learning by modeling class indicators as bandlimited signals and analyzing their bandwidth in relation to dataset geometry.

Contribution

It introduces a sampling theory perspective, justifying the bandlimitedness assumption of class indicators in semi-supervised learning.

Findings

01

Bandwidth of class indicators relates to dataset geometry

02

The approach applies to general data models with separable and nonseparable classes

03

Provides a theoretical basis for graph-based semi-supervised classification

Abstract

Graph-based methods have been quite successful in solving unsupervised and semi-supervised learning problems, as they provide a means to capture the underlying geometry of the dataset. It is often desirable for the constructed graph to satisfy two properties: first, data points that are similar in the feature space should be strongly connected on the graph, and second, the class label information should vary smoothly with respect to the graph, where smoothness is measured using the spectral properties of the graph Laplacian matrix. Recent works have justified some of these smoothness conditions by showing that they are strongly linked to the semi-supervised smoothness assumption and its variants. In this work, we reinforce this connection by viewing the problem from a graph sampling theoretic perspective, where class indicator functions are treated as bandlimited graph signals (in the…

Figures14

Click any figure to enlarge with its caption.

Tables2

Table 1. TABLE I : Related convergence results in the literature under different data models and graph construction schemes. All models assume that the distributions are smooth (at least twice-differentiable). Further, the graph Laplacian is defined as 𝐋 = 1 n ( 𝐃 − 𝐖 ) 𝐋 1 𝑛 𝐃 𝐖 {\bf L}=\frac{1}{n}({\bf D}-{\bf W}) in all cases. [ 24 ] also studies convergence of graph cuts for weighted k 𝑘 k -nearest neighbor and r 𝑟 r -neighborhood graphs which we do not include for brevity.

Work	Data model	Graph model	Quantity	Convergence regime	Limit (within constant scaling factor)
Narayanan et al [3]	$p (𝐱)$ supported on manifold $ℳ \subset ℝ^{d}$ , separated into $S$ and $S^{c}$ by smooth hypersurface $\partial S$	Normalized Gaussian weights $w_{i j}^{'} = \frac{w_{i j}}{\sqrt{d_{i} d_{j}}}$	$\frac{1}{n σ} 𝟏_{S}^{T} {𝐋𝟏}_{S}$	$n \to \infty$ , $σ \to 0$	$\int_{\partial S} p (𝐬) 𝑑 𝐬$
Maier et al [24]	$p (𝐱)$ supported on $ℳ \subset ℝ^{d}$ , separated into $S$ and $S^{c}$ by hyperplane $\partial S$	$r$ -neighborhood, unweighted	$\frac{1}{n r^{d + 1}} 𝟏_{S}^{T} {𝐋𝟏}_{S}$	$n \to \infty$ , $r \to 0$	$\int_{\partial S} p^{2} (𝐬) 𝑑 𝐬$
		$k$ -nn, unweighted, $t = {(k / n)}^{1 / d}$	$\frac{1}{n t^{d + 1}} 𝟏_{S}^{T} {𝐋𝟏}_{S}$	$n \to \infty$ , $t \to 0$	$\int_{\partial S} p^{1 - 1 / d} (𝐬) 𝑑 𝐬$
		fully-connected, Gaussian weights	$\frac{1}{n σ} 𝟏_{S}^{T} {𝐋𝟏}_{S}$	$n \to \infty$ , $σ \to 0$	$\int_{\partial S} p^{2} (𝐬) 𝑑 𝐬$
Bousquet et al [17], Hein [18]	$p (𝐱)$ and $f (𝐱)$ supported on $ℝ^{d}$	fully-connected, weights $w_{i j} =$ $\frac{1}{n σ^{d}} K (\frac{{‖ 𝐗_{i} - 𝐗_{j} ‖}^{2}}{σ^{2}})$ , where $K (.)$ is a smooth decaying kernel	$\frac{1}{n σ^{2}} 𝐟^{T} 𝐋𝐟$	$n \to \infty$ , $σ \to 0$	$\int {‖ \nabla f (𝐱) ‖}^{2} p^{2} (𝐱) 𝑑 𝐱$
Zhou et al [6]	Uniformly distributed on $d$ -dim. submanifold $ℳ$	fully-connected, Gaussian weights	$\frac{1}{n σ^{m}} 𝐟^{T} 𝐋^{m} 𝐟$	$n \to \infty$ , $σ \to 0$	$\int f (𝐱) Δ^{m} f (𝐱) 𝑑 𝐱$
García Trillos & Slepčev [5]	p(x) supported on $D \subset ℝ^{d}$	fully-connected, weights $w_{i j} =$ $\frac{1}{ε^{d}} η (\frac{‖ 𝐗_{i} - 𝐗_{j} ‖}{ε})$ , where $η (.)$ is a smoothly decaying kernel	$\frac{1}{n^{2} ε} G T V (𝐟)$	$n \to \infty$ , $ε \to 0$	$\int ‖ \nabla f (𝐱) ‖ p^{2} (𝐱) 𝑑 𝐱$
El Alaoui et al [8], Slepčev & Thorpe [22]	p(x) supported on ${[0, 1]}^{d}$ , $Ω \subset ℝ^{d}$	fully-connected, weights $w_{i j} =$ $ϕ (\frac{‖ 𝐗_{i} - 𝐗_{j} ‖}{h})$ , where $ϕ (.)$ is a smoothly decaying kernel	$\frac{1}{n^{2} h^{p + d}} J_{p} (𝐟)$	$n \to \infty$ , $h \to 0$	$\int {‖ \nabla f (𝐱) ‖}^{p} p^{2} (𝐱) 𝑑 𝐱$
This work	$p (𝐱)$ supported on $ℝ^{d}$ , separated into $S$ and $S^{c}$ by smooth hypersurface $\partial S$	fully-connected, Gaussian weights	$\frac{1}{σ^{1 / m}} ω_{m} (𝟏_{S})$	$n \to \infty$ , $σ \to 0$ , $m \to \infty$	${sup}_{𝐬 \in \partial S} p (𝐬)$
	Drawn from $p_{A} (𝐱)$ and $p_{A^{c}} (𝐱)$ supported on $ℝ^{d}$ with probabilities $α_{A}$ and $α_{A^{c}}$	fully-connected, Gaussian weights	$ω_{m} (𝟏_{A})$	$n \to \infty$ , $σ \to 0$ , $m \to \infty$	${sup}_{𝐱 \in \partial A} p (𝐱)$

Table 2. TABLE II : Illustrative boundaries used in the separable model.

Boundary	Description	${sup}_{𝐬 \in \partial S} p (𝐬)$
$\partial S_{1}$	$x = 0$	0.0607
$\partial S_{2}$	$x = - 1$	0.2547
$\partial S_{3}$	$x = y^{2} - 1$	0.2547
$\partial S_{4}$	$y = 0$	0.5969
$\partial S_{5}$	$x^{2} + y^{2} = 1$	0.5969

Equations345

p (x) = α_{A} p_{A} (x) + α_{A^{c}} p_{A^{c}} (x) .

p (x) = α_{A} p_{A} (x) + α_{A^{c}} p_{A^{c}} (x) .

\partial A := {x \in R^{d} ∣ p_{A} (x) p_{A^{c}} (x) > 0} .

\partial A := {x \in R^{d} ∣ p_{A} (x) p_{A^{c}} (x) > 0} .

w_{ij} := K_{σ^{2}} (X_{i}, X_{j}) = \frac{1}{( 2 π σ ^{2} ) ^{d /2}} e^{- ∥ X_{i} - X_{j} ∥^{2} /2 σ^{2}},

w_{ij} := K_{σ^{2}} (X_{i}, X_{j}) = \frac{1}{( 2 π σ ^{2} ) ^{d /2}} e^{- ∥ X_{i} - X_{j} ∥^{2} /2 σ^{2}},

\omega({\bf f}):=\max_{i}\big{\{}\lambda_{i}\;\big{|}\;|{\bf u}_{i}^{T}{\bf f}|>0\big{\}}.

\omega({\bf f}):=\max_{i}\big{\{}\lambda_{i}\;\big{|}\;|{\bf u}_{i}^{T}{\bf f}|>0\big{\}}.

min_{g} ∥ g_{L} - f_{L} ∥^{2} subject to ω (g) \leq θ,

min_{g} ∥ g_{L} - f_{L} ∥^{2} subject to ω (g) \leq θ,

f = i = 1 \sum N_{L} (ω (f)) c_{i} u_{i} = U_{:, R} c,

f = i = 1 \sum N_{L} (ω (f)) c_{i} u_{i} = U_{:, R} c,

ω_{m} (f) := (\frac{f ^{T} L ^{m} f}{f ^{T} f})^{1/ m},

ω_{m} (f) := (\frac{f ^{T} L ^{m} f}{f ^{T} f})^{1/ m},

\forall f, ω (f) = m \to \infty lim ω_{m} (f) .

\forall f, ω (f) = m \to \infty lim ω_{m} (f) .

ω_{m} (1_{S}) = (\frac{1 _{S}^{T} L ^{m} 1 _{S}}{1 _{S}^{T} 1 _{S}})^{\frac{1}{m}} and ω_{m} (1_{A}) = (\frac{1 _{A}^{T} L ^{m} 1 _{A}}{1 _{A}^{T} 1 _{A}})^{\frac{1}{m}}

ω_{m} (1_{S}) = (\frac{1 _{S}^{T} L ^{m} 1 _{S}}{1 _{S}^{T} 1 _{S}})^{\frac{1}{m}} and ω_{m} (1_{A}) = (\frac{1 _{A}^{T} L ^{m} 1 _{A}}{1 _{A}^{T} 1 _{A}})^{\frac{1}{m}}

Cut (S, S^{c}) := X_{i} \in S, X_{j} \in S^{c} \sum w_{ij} = n 1_{S}^{T} L 1_{S} .

Cut (S, S^{c}) := X_{i} \in S, X_{j} \in S^{c} \sum w_{ij} = n 1_{S}^{T} L 1_{S} .

\frac{1}{nσ} 1_{S}^{T} L 1_{S} p . \frac{1}{2 π} \int_{\partial S} p^{2} (s) d s,

\frac{1}{nσ} 1_{S}^{T} L 1_{S} p . \frac{1}{2 π} \int_{\partial S} p^{2} (s) d s,

min_{f} f^{T} L f such that f (L) = y (L),

min_{f} f^{T} L f such that f (L) = y (L),

\frac{1}{n σ ^{2}} f^{T} L f p . C \int_{R^{d}} ∥\nabla f (x) ∥^{2} p^{2} (x) d x,

\frac{1}{n σ ^{2}} f^{T} L f p . C \int_{R^{d}} ∥\nabla f (x) ∥^{2} p^{2} (x) d x,

\frac{1}{n ^{2} ε} GT V (f) Γ C \int_{D} ∥\nabla f (x) ∥ p^{2} (x) d x,

\frac{1}{n ^{2} ε} GT V (f) Γ C \int_{D} ∥\nabla f (x) ∥ p^{2} (x) d x,

\frac{1}{n σ _{n}^{m}} f^{T} L^{m} f p . C \int_{M} f (x) Δ^{m} f (x) d x,

\frac{1}{n σ _{n}^{m}} f^{T} L^{m} f p . C \int_{M} f (x) Δ^{m} f (x) d x,

\frac{1}{n ^{2} h ^{p + d}} J_{p} (f) p . C \int_{[0, 1]^{d}} ∥\nabla f (x) ∥^{p} p^{2} (x) d x .

\frac{1}{n ^{2} h ^{p + d}} J_{p} (f) p . C \int_{[0, 1]^{d}} ∥\nabla f (x) ∥^{p} p^{2} (x) d x .

\frac{1}{σ ^{1/ m}} ω_{m} (1_{S}) p. s \in \partial S sup p (s),

\frac{1}{σ ^{1/ m}} ω_{m} (1_{S}) p. s \in \partial S sup p (s),

ω_{m} (1_{A}) p. x \in \partial A sup p (x) .

ω_{m} (1_{A}) p. x \in \partial A sup p (x) .

m

m

σ

Cut (A, A^{c}) := X_{i} \in A, X_{j} \in A^{c} \sum w_{ij} = n 1_{A}^{t} L 1_{A} .

Cut (A, A^{c}) := X_{i} \in A, X_{j} \in A^{c} \sum w_{ij} = n 1_{A}^{t} L 1_{A} .

\frac{1}{n} Cut (A, A^{c}) p. \int α_{A} α_{A^{c}} p_{A} (x) p_{A^{c}} (x) d x .

\frac{1}{n} Cut (A, A^{c}) p. \int α_{A} α_{A^{c}} p_{A} (x) p_{A^{c}} (x) d x .

X_{S}

X_{S}

X_{A}

E {\frac{1}{n} N_{L} (t)} ⟶ P ({x : p (x) \leq t}) .

E {\frac{1}{n} N_{L} (t)} ⟶ P ({x : p (x) \leq t}) .

\frac{1}{n} N_{L} (ω (1_{S})) \to P (X_{S}),

\frac{1}{n} N_{L} (ω (1_{S})) \to P (X_{S}),

\frac{1}{n} N_{L} (ω (1_{A})) \to P (X_{A}) .

(ω_{m} (1_{R}))^{m} = \frac{1 _{R}^{T} L ^{m} 1 _{R}}{1 _{R}^{T} 1 _{R}} = \frac{\frac{1}{n} 1 _{R}^{T} L ^{m} 1 _{R}}{\frac{1}{n} 1 _{R}^{T} 1 _{R}} .

(ω_{m} (1_{R}))^{m} = \frac{1 _{R}^{T} L ^{m} 1 _{R}}{1 _{R}^{T} 1 _{R}} = \frac{\frac{1}{n} 1 _{R}^{T} L ^{m} 1 _{R}}{\frac{1}{n} 1 _{R}^{T} 1 _{R}} .

\frac{1}{n} 1_{R}^{T} 1_{R} a . s . \int_{x \in R} p (x) d x,

\frac{1}{n} 1_{R}^{T} 1_{R} a . s . \int_{x \in R} p (x) d x,

V

V

= \frac{1}{n ^{m + 1}} 1_{R}^{T} (k = 0 \sum 2^{m} - 1 B_{k}) 1_{R},

B_{k}

B_{k}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A Sampling Theory Perspective of Graph-based Semi-supervised Learning

Aamir Anis, Aly El Gamal, Salman Avestimehr, and Antonio Ortega This work is supported in part by NSF under grants CCF-1410009, CCF-1527874, CCF-1408639, NETS-1419632 and by AFRL and DARPA under grant 108818.S. Avestimehr and A. Ortega are with the Ming Hsieh Department of Electrical Engineering, University of Southern California. A. Anis is currently with Google Inc., he was affiliated with the University of Southern California at the time this work was completed. A. El Gamal is with the Department of Electrical and Computer Engineering, Purdue University.E-mail: [email protected], [email protected], [email protected], [email protected] (c) 2017 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected].

Abstract

Graph-based methods have been quite successful in solving unsupervised and semi-supervised learning problems, as they provide a means to capture the underlying geometry of the dataset. It is often desirable for the constructed graph to satisfy two properties: first, data points that are similar in the feature space should be strongly connected on the graph, and second, the class label information should vary smoothly with respect to the graph, where smoothness is measured using the spectral properties of the graph Laplacian matrix. Recent works have justified some of these smoothness conditions by showing that they are strongly linked to the semi-supervised smoothness assumption and its variants. In this work, we reinforce this connection by viewing the problem from a graph sampling theoretic perspective, where class indicator functions are treated as bandlimited graph signals (in the eigenvector basis of the graph Laplacian) and label prediction as a bandlimited reconstruction problem. Our approach involves analyzing the bandwidth of class indicator signals generated from statistical data models with separable and nonseparable classes. These models are quite general and mimic the nature of most real-world datasets. Our results show that in the asymptotic limit, the bandwidth of any class indicator is also closely related to the geometry of the dataset. This allows one to theoretically justify the assumption of bandlimitedness of class indicator signals, thereby providing a sampling theoretic interpretation of graph-based semi-supervised classification.

I Introduction

The abundance of unlabeled data in various machine learning applications, along with the prohibitive cost of labeling, has led to growing interest in semi-supervised learning. This paradigm deals with the task of classifying data points in the presence of very little labeling information by relying on the geometry of the dataset. Assuming that the features are well-chosen, a natural assumption in this setting is to consider the marginal density $p({\bf x})$ of the feature vectors to be informative about the labeling function $f({\bf x})$ defined on the points. This assumption is fundamental to the semi-supervised learning problem both in the classification and the regression settings, and is also known as the semi-supervised smoothness assumption [1], which states that the label function is smoother in regions of high data density. There also exist other similar variants of this assumption specialized for the classification setting, namely, the cluster assumption [2] (points in a cluster are likely to have the same class label) or the low density separation assumption [3] (decision boundaries pass through regions of low data density). Most present day algorithms for semi-supervised learning rely on one or more of these assumptions to predict the unknown labels.

In practice, graph-based methods have been found to be quite suitable for geometry-based learning tasks, primarily because they provide an easy way of exploiting information from the geometry of the dataset. These methods involve constructing a distance-based similarity graph whose vertices (nodes) represent the data points and whose edge weights are in general a decreasing function of the distances between them. The learning task then involves predicting the labels of the unknown nodes, given the known labels, often called the transductive learning paradigm. The key assumption here is that the label function is “smooth” over the graph, in the sense that labels of vertices do not vary much over edges with high weights (i.e., edges that connect close or similar points). There are numerous ways of quantitatively imposing smoothness constraints over label functions defined on the vertices of a similarity graph. Most graph-based semi-supervised classification algorithms incorporate one of these criteria as a penalty against the fitting error in a regularization problem, or as a constraint term while minimizing the fitting error in an optimization problem. For example, a commonly used measure of smoothness for a label function ${\bf f}$ is the graph Laplacian regularizer ${\bf f}^{T}{\bf L}{\bf f}$ ( ${\bf L}$ being the graph Laplacian), and many algorithms involve minimizing this quadratic energy function while ensuring that ${\bf f}$ satisfies the known set of labels [4, 2]. Another example is the graph total variation [5]. There also exist higher-order variants of the smoothness measure such as iterated graph Laplacian regularizers ${\bf f}^{T}{\bf L}^{m}{\bf f}$ [6] and the $p$ -Laplacian regularizer [7, 8], that have been shown to make the problem more well-behaved. On the other hand, a spectral theory based classification algorithm restricts ${\bf f}$ to be spanned by the first few eigenvectors of the graph Laplacian [9, 10], that are known to form a representation basis for smooth functions on the graph. In each of the examples, the criterion enforces smoothness of the labels over the graph – a lower value of the regularizer ${\bf f}^{T}{\bf L}{\bf f}$ , and a smaller number of leading eigenvectors to model ${\bf f}$ imply that vertices that are close neighbors on the graph are more likely to have the same label.

A more recent approach, derived from Graph Signal Processing (GSP) [11], considers the semi-supervised learning problem from the perspective of sampling theory for graph signals [12, 13, 14, 15]. It involves treating the class label function ${\bf f}$ as a bandlimited graph signal, and label prediction as a bandlimited reconstruction problem. The advantage of this approach is that one can also analyze, using sampling theory, the label complexity of graph-based semi-supervised classification, that is, the fraction of labeled vertices on the graph required for predicting the labels of the unlabeled vertices. A key ingredient in this formulation is the bandwidth $\omega({\bf f})$ of signals on the graph, which is defined as the largest Laplacian eigenvalue for which the projection of the signal over the corresponding eigenvector is non-zero. Signals with lower bandwidth tend to be smoother on the graph and have a lower label complexity. Label prediction using bandlimited reconstruction then involves estimating a graph signal that minimizes prediction error on the labeled set under a bandwidth constraint. This can also be carried out without explicitly computing the eigenvectors of the Laplacian, and has been shown to be quite competitive in comparison to state-of-the-art graph-based semi-supervised learning methods [16].

Although graph-based semi-supervised learning methods are well-motivated, their connection to the underlying geometry of the dataset had not been clearly understood so far in a theoretical sense. Recent works focused on justifying these approaches by exploring their geometrical interpretation in the limit of infinitely available unlabeled data. This is typically done by assuming a probabilistic generative model for the dataset and analyzing the graph smoothness criteria in the asymptotic setting for certain commonly-used graph construction schemes. For example, it has been shown that for data points drawn from a smooth distribution with an associated smooth label function (i.e., the regression setting), the graph Laplacian-based regularizers converge in the limit of infinite data points to some density-weighted variational energy functional that penalizes large variations of the labels in high density regions [10, 17, 18, 19, 6, 20, 5, 21, 22]. A similar connection ensues for semi-supervised learning problems in the classification setting (i.e., when labels are discrete in the feature space). If points drawn from a smooth distribution are separated by a smooth boundary into two classes, then the graph cut for the partition converges to a weighted volume of the boundary [3, 23, 24, 25]. This is consistent with the low density separation assumption – a low value of the graph cut implies that the boundary passes through regions of low data density.

To our knowledge, no such connections have been drawn for the sampling theoretic approach to learning. A geometrical interpretation of this approach would help complete our theoretical understanding of graph-based semi-supervised learning approaches and strengthen their link with the semi-supervised smoothness assumption and its variants. Therefore, in this work, we seek answers for the following questions:

•

What is the connection between the bandwidth of class indicator signals over the similarity graph and the underlying geometry of the data set?

•

What is the interpretation of the bandlimited reconstruction approach for label prediction?

•

How many labeled examples does one require for predicting the unknown labels?

To answer these questions, our work analyzes the asymptotic behavior of an iterated Laplacian-based bandwidth estimator for class indicator signals on similarity graphs constructed from a statistical model for the feature vectors. To make our analysis as general as possible, we consider two data models: separable and nonseparable. These generative models are quite practical and can be used to mimic most datasets in the real world. The separable model assumes that data points are independently drawn from an underlying probability distribution in the feature space and each class is separated from the others by a smooth boundary. On the other hand, the nonseparable model assumes a mixture distribution for the data where the data points are drawn independently with certain probability from separate class conditional distributions. We also introduce a notion of “boundaries” for classes in the nonseparable model in the form of overlap regions (i.e., the region of ambiguity), defined as the set of points where the probability of belonging and not belonging to a class are both non-zero. This definition is quite practical and useful for characterizing the geometry of such datasets.

Using the data points, we consider a specific graph construction scheme that applies the Gaussian kernel over Euclidean distances between feature vectors for computing their similarities (our analysis can be generalized easily to arbitrary kernels under simple assumptions). In order to compute the bandwidth of any signal on the graph, we define an estimator based on the iterated Laplacian regularizer. A significant portion of this paper focuses on analyzing the stochastic convergence of this bandwidth estimate (using variance-bias decomposition) in the limit of infinite data points for any class indicator signal on the graph. The analysis in our work suggests a novel sampling theoretic interpretation of graph-based semi-supervised learning and the main contributions can be summarized as follows:

•

Relationship between bandwidth and data geometry. For the separable model, we show that under certain rate conditions, the bandwidth estimate for any class indicator signal over the graph converges to the supremum of the data density over the class boundary. Similarly, for the nonseparable model, we show that the bandwidth estimate converges to the supremum of the density over the overlap region. Based on these results, we conjecture, with supporting experiments, that the bandwidths also converge to the same values.

•

Interpretation of bandlimited reconstruction. Using the geometrical interpretation of the bandwidth, we conclude that bandlimited reconstruction allows one to choose the complexity of the hypothesis space while predicting unknown labels (i.e., a larger bandwidth allows more complex class boundaries).

•

Quantification of label complexity for sampling theory-based learning. For both the separable and nonseparable models, we conjecture, with supporting arguments and experiments, that the fraction of labeled nodes on the graph for reconstructing class indicator signals converges, in the asymptotic limit, to the probability mass of the sublevel set that entirely encompasses the boundary.

Our analysis has significant implications: Firstly, class indicator signals have a low bandwidth if class boundaries lie in regions of low data densities, that is, the semi-supervised assumption holds for graph-based methods. And secondly, our analysis also helps quantify the impact of bandwidth and data geometry in semi-supervised learning problems. Specifically, it enables us to theoretically assert that for the sampling theoretic approach to graph-based semi-supervised learning, the label complexity of class indicator signals over the graph is indeed lower if the boundary lies in regions of low data density, as demonstrated empirically in earlier works [9, 10].

The rest of this paper is organized as follows: In Section II, we formally introduce the statistical data models and the graph construction scheme for analysis, along with a precursor of concepts from graph sampling theory. In Section III, we review prior work and underline their connections with our work. In Section IV, we state our main results and outline their implications. In Section V, we prove the major building blocks for our results. We finally conclude with numerical validation in Section VI, followed by discussion and an outline of future work in Section VII. It is worth noting that the bandwidth convergence result for the separable model and an interpretation of bandlimited reconstruction were given in our preliminary work [26]. This paper presents complete formal proofs for those results, extends them to the nonseparable model, and also analyzes label complexity.

II Preliminaries

II-A Data models

II-A1 The separable model

In this model, we assume that the dataset consists of a pool of $n$ random, $d$ -dimensional feature vectors ${\cal X}=\{{\bf X}_{1},{\bf X}_{2},\dots,{\bf X}_{n}\}$ drawn independently from some probability density function $p({\bf x})$ supported on $\mathbb{R}^{d}$ (this is assumed for simplicity, the analysis can be extended to subsets $D\subset\mathbb{R}^{d}$ and low-dimensional manifolds $\mathcal{M}$ in $\mathbb{R}^{d}$ , but would more technically involved). To simplify our analysis, we also assume that $p({\bf x})$ is bounded from above, Lipschitz continuous and twice differentiable. We assume that a smooth hypersurface $\partial S$ , with radius of curvature lower bounded by a constant $\tau$ , splits $\mathbb{R}^{d}$ into two disjoint classes $S$ and $S^{c}$ , with indicator functions $1_{S}({\bf x}):\mathbb{R}^{d}\rightarrow\{0,1\}$ and $1_{S^{c}}({\bf x}):\mathbb{R}^{d}\rightarrow\{0,1\}$ . This is illustrated in Figure 1(a). Thus, the $n$ -dimensional class indicator signal for class $S$ is denoted by the bold-faced vector notation ${\bf 1}_{S}\in\{0,1\}^{n}$ , and defined as $({\bf 1}_{S})_{i}:=1_{S}({\bf X}_{i})$ , i.e., the $i^{\text{th}}$ entry of ${\bf 1}_{S}$ is $1$ if ${\bf X}_{i}\in S$ and [math] otherwise.

II-A2 The nonseparable model

In this model, we assume that each class has its own conditional distribution supported on $\mathbb{R}^{d}$ (that may or may not overlap with other distributions of other classes). The data set consists of a pool of $n$ random and independent $d$ -dimensional feature vectors ${\cal X}=\{{\bf X}_{1},{\bf X}_{2},\dots,{\bf X}_{n}\}$ drawn independently from any of the distributions $p_{i}({\bf x})$ with probabilities $\alpha_{i}$ , such that $\sum_{i}\alpha_{i}=1$ . For our analysis, we consider a class denoted by an index $A$ with selection probability $\alpha_{A}$ , class conditional distribution $p_{A}({\bf x})$ and an $n$ -dimensional indicator vector ${\bf 1}_{A}$ whose $i^{\text{th}}$ component takes value $1$ if ${\bf X}_{i}$ is drawn from class $A$ . Note that ${\bf 1}_{A}$ does not have a continuous domain counterpart, unlike ${\bf 1}_{S}$ which is sampled from the indicator function $1_{S}({\bf x})$ on points in ${\cal X}$ . We illustrate the nonseparable model in Figure 1(b). Further, we denote by $\alpha_{A^{c}}=1-\alpha_{A}$ the probability that a point does not belong to $A$ and by $p_{A^{c}}({\bf x})=\sum_{i\neq A}\alpha_{i}p_{i}({\bf x})/\alpha_{A^{c}}$ the density of all such points. The marginal distribution of data points is then given by the mixture density

[TABLE]

Once again, to simplify our analysis, we assume that all distributions are Lipschitz continuous, bounded from above and twice differentiable in $\mathbb{R}^{d}$ . Next, we introduce the notion of a “boundary” for classes in the nonseparable model as follows: for class $A$ , we define its overlap region $\partial A$ as

[TABLE]

Intuitively, $\partial A$ can be considered as the region of ambiguity, where both points belonging and not belonging to $A$ co-exist. In other words, $\partial A$ can be thought of as a “boundary” that separates the region where points can only belong to $A$ from the region where points can never belong to $A$ . Since class indicator signals on graphs will change values only within the overlap region, one would expect that the indicators will be smoother if there are fewer data points within this region. We shall show later that this is indeed the case, both theoretically and experimentally. Note that the definition of the boundary is not very meaningful for class conditional distributions with decaying tails, such as the Gaussian, since the boundary in this case technically encompasses the entire feature space. However, in such cases, one can approximate the boundary with appropriate thresholds in the definition and this approximation can also be formalized for distributions with exponentially decaying tails.

II-B Graph construction

Using the $n$ feature vectors, we construct an undirected distance-based similarity graph where nodes represent the data points and edge weights are proportional to their similarity, given by the Gaussian kernel:

[TABLE]

where $\sigma$ is the variance (bandwidth) of the Gaussian kernel. Further, we assume $w_{ii}=0$ , i.e., the graph does not have self-loops. The adjacency matrix of the graph ${\bf W}$ is an $n\times n$ symmetric matrix with elements $w_{ij}$ , while the degree matrix is a diagonal matrix with elements ${\bf D}_{ii}=\sum_{j}w_{ij}$ . We define the graph Laplacian as ${\bf L}=\frac{1}{n}({\bf D}-{\bf W})$ . Normalization by $n$ ensures that the norm of ${\bf L}$ is stochastically bounded as $n$ grows. Since the graph is undirected, ${\bf L}$ is a symmetric matrix with non-negative eigenvalues $0\leq\lambda_{1}\leq\dots\leq\lambda_{n}$ and an orthogonal set of corresponding eigenvectors $\{{\bf u}_{1},\dots,{\bf u}_{n}\}$ . It is known that for a larger eigenvalue $\lambda$ , the corresponding eigenvector ${\bf u}$ exhibits greater variation when plotted over the nodes of the graph [11]. Thus, one of the fundamental postulates of Graph Signal Processing consists of using the eigen-decomposition of ${\bf L}$ to provide a notion of frequency for graph signals, with the eigenvalues acting as graph frequencies and the eigenvectors forming the graph Fourier basis [11].

II-C Graph sampling theory: bandwidth, bandlimited reconstruction and label complexity

In traditional sampling theory, bandwidth plays an important role in specifying the inherent dimensionality of a signal and therefore determines the sampling rate required for perfect reconstruction. A similar notion exists for signals defined over graphs – the bandwidth $\omega({\bf f})$ of any signal ${\bf f}$ on the graph is defined as the largest eigenvalue for which the projection of the signal on the corresponding eigenvector is non-zero [27, 12, 15], i.e.,

[TABLE]

Signals with lower bandwidth have low frequency content, and tend to be smoother on the graph.

Bandwidth plays a central role in the sampling theoretic approach to semi-supervised learning, where the class indicator signals are assumed to be bandlimited over the similarity graph and interpolated through bandlimited reconstruction. For a ground-truth signal ${\bf f}$ that we are trying to reconstruct, and whose values are known only on a subset $L\subset\{1,2,\dots,n\}$ , this approach involves solving the following least-squares problem [28, 16]:

[TABLE]

where ${\bf g}_{L}$ and ${\bf f}_{L}$ denote the values of ${\bf g}$ and ${\bf f}$ , respectively, on the set $L$ . The constraint restricts the hypothesis space to a set of bandlimited signals with bandwidth less than $\theta$ , which is equivalent to enforcing smoothness of the labels over the graph. This method essentially improves upon the Fourier eigenvector approach suggested in [9, 10] in two ways: first, label prediction can be carried out without explicitly computing the eigenvectors of ${\bf L}$ using efficient iterative approaches implemented via graph filtering operations [28, 29]. And second, one can also use the sampling theorem for graph signals to set $\theta$ as the cutoff frequency $\omega_{c}(L)$ associated with the labeled set [12, 15], which, for a given $L$ , is defined as the bandwidth below which any bandlimited signal is uniquely represented by its values on $L$ . This approach is taken in [27, 16], and is particularly useful when $\omega({\bf f})<\omega_{c}(L)$ , in which case the minimizer ${\bf g}^{*}$ of (5) exactly equals ${\bf f}$ , i.e., $\|{\bf g}^{*}-{\bf f}\|=0$ . Alternatively, one can also reconstruct ${\bf f}$ using the variational problem: $\mathop{\rm min}_{\bf g}\;\omega({\bf g})\;\;\text{subject to}\;\;{\bf g}_{L}={\bf f}_{L}$ ; the minimizer in this case is also exactly equal to ${\bf f}$ if $\omega({\bf f})<\omega_{c}(L)$ [30, 15]. Further, it also possible to provide error bounds for both methods when $\omega({\bf f})<\omega_{c}(L)$ is not satisfied [15].

The bandwidth of any indicator signal is also useful in specifying the amount of labeling required for its recovery in the context of sampling theory, as demonstrated by the following key result [15]:

Lemma 1.

Let $\mathcal{N}_{\bf L}(t)$ denote the number of eigenvalues of ${\bf L}$ less than or equal to $t$ . Then, for any signal ${\bf f}$ with bandwidth $\omega({\bf f})$ , there exists a subset of nodes $T\subseteq V$ of size $|T|=\mathcal{N}_{\bf L}(\omega({\bf f}))$ such that ${\bf f}$ can be perfectly recovered from its values ${\bf f}_{T}$ on $T$ .

Proof.

Since ${\bf f}$ has bandwidth $\omega({\bf f})$ , it is spanned by the first $\mathcal{N}_{\bf L}(\omega({\bf f}))$ eigenvectors of ${\bf L}$ , i.e., let $R:=\{1,\dots,r\}$ , then we have

[TABLE]

where $c_{i}\neq 0$ for $i=\mathcal{N}_{\bf L}(\omega({\bf f}))$ and ${\bf U}_{:,R}$ denotes the rectangular matrix formed using the first $r$ eigenvectors of ${\bf L}$ . Since the eigenvectors $\{{\bf u}_{i}\}$ are orthogonal, ${\bf U}_{:,R}$ has rank $r=\mathcal{N}_{\bf L}(\omega({\bf f}))$ . Therefore, there exists a subset of rows, indexed by a set $T$ , with cardinality $|T|=r=\mathcal{N}_{\bf L}(\omega({\bf f}))$ , such that the $r\times r$ matrix ${\bf U}_{T,R}$ is full-rank, and thus invertible. Using this in (6), we get ${\bf c}={\bf U}_{T,R}^{-1}{\bf f}_{T}$ and thus ${\bf f}$ can be perfectly recovered from ${\bf f}_{T}$ as ${\bf f}={\bf U}_{V,R}{\bf U}_{T,R}^{-1}{\bf f}_{T}$ , thereby proving our claim. Note that this is exactly the closed-form solution of (5), for $L=T$ and $\theta=\omega({\bf f})$ , when the eigenvectors of ${\bf L}$ are known. ∎

We shall use this result later, to compute the label complexity of any signal ${\bf f}$ on the graph as $\frac{1}{n}{\cal N}_{\bf L}\left(\omega({\bf f})\right)$ . Note, however, that this quantity only specifies the fraction of nodes to label on the graph – selecting which nodes to label is another question altogether. This problem has been well-studied as part of graph sampling theory [12, 15, 13, 14], with consideration of other important issues such as stability of reconstruction and computational complexity.

II-D Estimating bandwidth for graph signals

Ideally, computing the bandwidth $\omega({\bf f})$ of a graph signal ${\bf f}$ requires obtaining the eigenvectors $\{{\bf u}_{i}\}$ of ${\bf L}$ and the corresponding projections $\tilde{{\bf f}}_{i}={\bf u}_{i}^{T}{\bf f}$ . However, analyzing the convergence of these coefficients is technically challenging. Therefore, we resort to the following estimate of the bandwidth [15]:

[TABLE]

where we call $\omega_{m}({\bf f})$ the $m^{\text{th}}$ -order bandwidth estimate. It can be shown that the bandwidth estimates satisfy the property: for all $0<m_{1}<m_{2}$ , $\omega_{m_{1}}({\bf f})\leq\omega_{m_{2}}({\bf f})\leq\omega({\bf f})$ . In other words, $\{\omega_{m}({\bf f})\}$ forms a monotonically improving sequence of estimates of the true bandwidth $\omega({\bf f})$ . Further, we can also show [15]:

[TABLE]

II-E Focus of this paper

The discussion in Section II-C indicates that in the discrete setting, with finite number of data points, the notions of bandwidth, bandlimited reconstruction and label complexity are well-motivated and quite useful in highlighting a sampling theory perspective of graph-based semi-supervised learning. However, there is a lack of understanding of these concepts in terms of their geometrical interpretation, i.e., their connection with the underlying geometry of the dataset. Thus, inspired by existing analysis in the literature for popular graph-based smoothness measures, we seek to bridge this gap by analyzing these concepts in the asymptotic regime of infinite data points for the data models and graph construction scheme described earlier.

Analyzing the convergence of the bandwidth estimates of class indicator signals for the separable and the nonseparable models constitutes the main subject for the rest of this paper. Our approach, similar to existing results in the literature, starts in the discrete domain by drawing $n$ samples from the data models, constructs a sequence of graphs $G_{n,\sigma}$ from the data points, and considers the behavior of

[TABLE]

over the graphs as $n\rightarrow\infty$ , $\sigma\rightarrow 0$ and $m\rightarrow\infty$ . Intuitively, the condition $n\rightarrow\infty$ implies an abundance of unlabeled data, $\sigma\rightarrow 0$ dictates that the connectivity around each node is meaningful and does not blow up, and $m\rightarrow\infty$ translates to improving estimates of the bandwidth. Our analysis relates $\omega_{m}({\bf 1}_{S})$ and $\omega_{m}({\bf 1}_{A})$ to the underlying data distribution $p({\bf x})$ and class boundaries – the hypersurface $\partial S$ in the separable case and the overlap region $\partial A$ in the nonseparable case. Using these results, we also comment on the label complexities of reconstructing ${\bf 1}_{S}$ and ${\bf 1}_{A}$ over the graph in the asymptotic limit.

III Related work and connections

Existing convergence analyses of the graph-based smoothness measures for various graph construction schemes appear in two different settings – classification and regression. The classification setting assumes that labels indicate class memberships and are discrete, typically with $1/0$ values. Note that both the separable and nonseparable data models considered in our paper are in the classification setting. On the other hand, in the regression setting, one allows the class label signal ${\bf f}$ to be sampled from a smooth function on $\mathbb{R}^{d}$ with soft values, such that ${\bf f}\in\mathbb{R}^{n}$ , and later applies some thresholding mechanism to infer class memberships. For example, in the two class problem, one can assign $+1$ and $-1$ to the two classes and threshold ${\bf f}$ at [math]. Convergence analysis of smoothness measures in this setting requires different scaling conditions than the classification setting, and leads to fundamentally different limit values that require differentiability of the label functions in the continuum. Applying these to class indicator functions may lead to ill-defined results. A summary of convergence results in the literature for both settings is presented in Table I. Although these results do not focus on analyzing the bandwidth of class indicator signals, the proof techniques used in this paper are inspired by some of these works. We review them in this section and discuss their connections to our work.

III-A Classification setting

Prior work under this setting assumes the separable data model where the feature space is partitioned by smooth decision boundaries into different classes. When $m=1$ , the bandwidth estimate $\omega_{m}({\bf 1}_{S})$ for the separable model in our work reduces (within a scaling factor) to the empirical graph cut for the partitions $S$ and $S^{c}$ of the feature space, i.e.,

[TABLE]

Convergence of this quantity has been studied before in the context of spectral clustering, where one tries to minimize it across the two partitions of the nodes. It has been shown in [24] that the cut formed by a hyperplane $\partial S$ in $\mathbb{R}^{d}$ converges with some scaling under the rate conditions $\sigma\rightarrow 0$ and $n\sigma^{d+1}\rightarrow\infty$ as

[TABLE]

where $d{\bf s}$ ranges over all $(d-1)$ -dimensional volume elements tangent to the hyperplane $\partial S$ , and $p.$ denotes convergence in probability. The analysis has also been extended to other graph construction schemes such as the $k$ -nearest neighbor graph and the $r$ -neighborhood graph, both weighted and unweighted. The condition $\sigma\rightarrow 0$ in (10) is required to have a clear and well-defined limit on the right hand side. We borrow this convergence regime in our work, since it allows a succinct interpretation of the bandwidth of class indicator signals. Intuitively, it enforces sparsity in the similarity matrix ${\bf W}$ by shrinking the neighborhood volume as the number of data points increases. As a result, one can ensure that the graph remains sparse even as the number of points goes to infinity. A similar result for a similarity graph constructed with normalized weights $w^{\prime}_{ij}=w_{ij}/\sqrt{d_{i}d_{j}}$ was shown earlier for an arbitrary hypersurface $\partial S$ in [3], where $d_{i}$ denotes the degree of node $i$ . In this case, normalization of the graph weights results in convergence to $\frac{1}{\sqrt{2\pi}}\int_{\partial S}p({\bf s})d{\bf s}$ . Similarly, in [23], the convergence of normalized cuts is analyzed for points drawn from a uniform density. All of these results aim to provide an interpretation for spectral clustering – up to some scaling, the empirical cut value converges to a weighted volume of the boundary. Thus, spectral clustering is a means of performing low density separation on a finite sample drawn from a distribution in feature space.

Note that these works provide little insight for the convergence analysis of higher-order regularizers, i.e., $\omega_{m}({\bf 1}_{S})$ for $m>1$ in our case, since these require different scaling factors and rate conditions. Further, we get no clue about the continuum limit values of $\omega_{m}({\bf 1}_{S})$ and $\omega_{m}({\bf 1}_{A})$ from any of these results. However, the definition and some of the proof techniques we use for the separable models in this paper have been inspired by [3, 24].

III-B Regression setting

To predict the labels of unknown samples in the regression setting, one generally minimizes the graph Laplacian regularizer ${\bf f}^{T}{\bf L}{\bf f}$ subject to the known label constraints [4]:

[TABLE]

One particular convergence result in this setting assumes that $n$ data points are drawn i.i.d. from $p({\bf x})$ and are labeled by sampling a smooth function $f({\bf x})$ on $\mathbb{R}^{d}$ . Here, the graph Laplacian regularizer ${\bf f}^{T}{\bf L}{\bf f}$ can be shown to converge in the asymptotic limit under the conditions $\sigma\rightarrow 0$ and $n\sigma^{d}\rightarrow\infty$ as in [17, 18]:

[TABLE]

where for each $n$ , ${\bf f}$ is the $n$ -dimensional label vector representing the values of $f({\bf x})$ at the $n$ sample points, $\nabla$ is the gradient operator and $C$ is a constant factor independent of $n$ and $\sigma$ . The right hand side of the result above is a weighted Dirichlet energy functional that penalizes variation in the label function weighted by the data distribution. Similar to the justification of spectral clustering, this result justifies using the formulation in (11) for semi-supervised classification: given label constraints, the predicted label function must vary little in regions of high density. The work of [18, 31] generalizes this result by using arbitrary kernel functions for defining graph weights, and defining data distributions over manifolds in $\mathbb{R}^{d}$ . Convergence results for another regularizer called Graph Total Variation, defined as $GTV({\bf f})=\sum_{i,j}w_{ij}|f_{i}-f_{j}|$ , are presented in [5, 21]. For data points drawn from $p({\bf x})$ defined over a domain $D\subset\mathbb{R}^{d}$ , graph weights given by $w_{ij}=\frac{1}{\varepsilon^{d}}\eta\left(\frac{\|{\bf X}_{i}-{\bf X}_{j}\|}{\varepsilon}\right)$ , one has as $n\rightarrow\infty$ and $\varepsilon\rightarrow 0$ :

[TABLE]

where the limit is analyzed in the setting of $\Gamma$ -convergence [5]. These results extend to the classification setting when $f({\bf x})$ is an indicator function, for example, the limit for $f({\bf x})=1_{S}({\bf x})$ reduces to that of (10). This approach is used in [25] to analyze convergence of Cheeger and ratio cuts.

Similar convergence results have also been derived for the higher-order Laplacian regularizer ${\bf f}^{T}{\bf L}^{m}{\bf f}$ obtained from uniformly distributed data [6]. In this case, it was shown that for data points obtained from a uniform distribution on a $d$ -dimensional submanifold ${\cal M}\subset\mathbb{R}^{N}$ such that ${\rm Vol}({\cal M})=1$ and $2m$ -differentiable functions $f({\bf x})$ , one has as $n\rightarrow\infty$ :

[TABLE]

where $\Delta$ is the Laplace operator and $\sigma_{n}=n^{-1/(2d+4+\alpha)}$ is a vanishing sequence with $\alpha>0$ . Extensions for non-uniform probability distributions $p({\bf x})$ over the manifold can be obtained using the weighted Laplace-Beltrami operator [19, 20]. More recently, an $\ell_{p}$ -based Laplacian regularization has been proposed for imposing smoothness constraints in semi-supervised learning problems [7, 8]. This is similar to a higher-order regularizer but is defined as $J_{p}({\bf f}):=\sum_{i,j\in E}w_{ij}^{p}|f_{i}-f_{j}|^{p}$ , where $w_{ij}=\phi(\|{\bf X}_{i}-{\bf X}_{j}\|/h)$ and $\phi(.)$ is a smoothly decaying Kernel function. It has been shown for a bounded density $p({\bf x})$ defined on $[0,1]^{d}$ that for every $p\geq 2$ , as $n\rightarrow\infty$ , followed by $h\rightarrow 0$ ,

[TABLE]

The work of [22] generalizes this result over an open, bounded and connected set $\Omega\subset\mathbb{R}^{d}$ and analyzes rate conditions such that the scalings $n\rightarrow\infty$ , $h\rightarrow 0$ occur jointly.

Note that although our work also uses higher powers of ${\bf L}$ in the expressions for $\omega_{m}({\bf 1}_{S})$ and $\omega_{m}({\bf 1}_{A})$ , we cannot use the convergence results in (14) and the proof techniques of (15), since they are only applicable for smooth functions (i.e., differentiable up to a certain order) on $\mathbb{R}^{d}$ . Specifically, in our case, ${\bf 1}_{S}$ in the separable model is sampled from a discontinuous indicator function $1_{S}({\bf x})$ , hence plugging it into existing results does not give a meaningful result for higher values of $m$ . Further, the nonseparable model can only be defined in the classification setting, i.e., ${\bf 1}_{A}$ in the nonseparable model does not have a continuum counterpart. Therefore, our analysis has to take a different route that has more similarities with the proof techniques used for the classification setting. We shall later see that a bulk of the effort in proving our results goes into expanding $\omega_{m}({\bf 1}_{S})$ and $\omega_{m}({\bf 1}_{A})$ for any $m$ by keeping track of every term in the expansion. This is followed by a careful evaluation of the integrals in their expected values by reducing them term-by-term.

IV Main results and Discussion

IV-A Interpretation of bandwidth and bandlimited reconstruction

We first show that under certain conditions, the bandwidth estimates of class indicator signals for both the data models, i.e., $\omega_{m}({\bf 1}_{S})$ and $\omega_{m}({\bf 1}_{A})$ , over Gaussian kernel-based similarity graphs $G_{n,\sigma}$ constructed from data points in ${\cal X}$ , converge to quantities that are functions of the underlying distribution and the class boundary for both data models. This convergence is achieved under the following asymptotic regime:

Increasing size of dataset: $n\rightarrow\infty$ . 2. 2.

Shrinking neighborhood volume: $\sigma\rightarrow 0$ . 3. 3.

Improving bandwidth estimates: $m\rightarrow\infty$ .

Note that an increasing size of the dataset $n\rightarrow\infty$ is required for the stochastic convergence of the bandwidth estimate. $\sigma\rightarrow 0$ ensures that the limiting values are concise and have a simple interpretation in terms of the data geometry. Intuitively, as the number of data points increases, the neighborhood around each data point shrinks – as a result, the degree of each node in the graph does not blow up. Finally, $m\rightarrow\infty$ leads to improving values of the bandwidth estimate.

The convergence results are precisely stated in the following theorems:

Theorem 1.

If $n\rightarrow\infty$ , $\sigma\rightarrow 0$ and $m\rightarrow\infty$ while satisfying the following rate conditions

$(n\sigma^{md+1})/(m^{2}C^{m})\rightarrow\infty$ , where $C=2/(2\pi)^{d/2}$ , 2. 2.

$m2^{m}\sigma\rightarrow 0$ ,

then for the separable model, one has

[TABLE]

where “p.” denotes convergence in probability.

Theorem 2.

If $n\rightarrow\infty$ , $\sigma\rightarrow 0$ and $m\rightarrow\infty$ while satisfying the following rate conditions

$(n\sigma^{md})/(m^{2}C^{m})\rightarrow\infty$ , where $C=2/(2\pi)^{d/2}$ , 2. 2.

$m2^{m}\sigma^{2}\rightarrow 0$ ,

then for the non-separable model, one has

[TABLE]

The dependence of the results on the rate conditions will be explained later in the proofs section. An example of parameter choices for scaling laws to hold simultaneously is illustrated in the following remark:

Remark 1.

Equations (16) and (17) hold if for each value of $n$ , we choose $m$ and $\sigma$ as follows:

[TABLE]

for constants $m_{0},\sigma_{0}>0$ , $0<y<1/2$ and $0<x<1$ . $[\;.\;]$ indicates taking the nearest integer value.

Theorems 1 and 17 give an explicit connection between bandwidth estimates of class indicator signals and class boundaries in the dataset. This interpretation forms the basis of justifying the choice of bandwidth as a smoothness constraint in graph-based learning algorithms. Theorem 1 suggests that for the separable model, if the boundary $\partial S$ passes through regions of low probability density, then the bandwidth of the corresponding class indicator vector $\omega({\bf 1}_{S})$ is low. A similar conclusion is suggested for the nonseparable model from Theorem 17, i.e., if the density of data points in the overlap region $\partial A$ is low, then the bandwidth $\omega({\bf 1}_{A})$ is low. In other words, low density of data in the boundary regions leads to smooth indicator functions.

From our results, we also get an intuition behind the smoothness constraint imposed in the bandlimited reconstruction approach (5) for semi-supervised learning. Basically, enforcing smoothness on classes in terms of indicator bandwidth ensures that the algorithm chooses a boundary passing through regions of low data density in the separable case. Similarly, in the nonseparable case, it ensures that variations in labels occur in regions of low density. Further, the bandwidth constraint $\theta$ in (5) effectively imposes a constraint on the complexity of the hypothesis space – a larger value increases the size of the hypothesis space and opens up choices consisting of more complex boundaries.

Note that Theorems 1 and 17 can be improved and their assumptions generalized in several ways:

•

The convergence results can be generalized to graphs with edge weights computed using any non-increasing kernel $\eta_{\sigma}(\|{\bf z}\|)=\frac{1}{\sigma^{d}}\eta(\|{\bf z}\|)$ , where $\sigma$ is a scaling parameter that controls the kernel width and goes to zero as $n\rightarrow\infty$ . The limits of $\omega_{m}({\bf 1}_{S})$ and $\omega_{m}({\bf 1}_{A})$ stay the same as in (16) and (17), up to a constant factor.

•

The domain of the data density $p({\bf x})$ can be generalized to open, bounded and connected sets $D\subset\mathbb{R}^{d}$ with Lipschitz boundary similar to the work of [5, 25, 21, 22], or a low dimensional compact manifold embedded in $\mathbb{R}^{d}$ as in [3, 18].

•

Convergence of the bandwidth estimates $\omega_{m}({\bf 1}_{S})$ and $\omega_{m}({\bf 1}_{A})$ does not imply convergence of the actual bandwidths $\omega({\bf 1}_{S})$ and $\omega({\bf 1}_{A})$ , respectively, to the same continuum limiting values. This is because the scaling of $m$ is tied to $n$ and $\sigma$ in our rate conditions, whereas ideally, one should take the limit $m\rightarrow\infty$ first, and independently of $n$ and $\sigma$ while analyzing the estimates. In this case, the scaling factor $\frac{1}{\sigma^{1/m}}$ in the left hand side of (16) also disappears. The analysis for this interchange of limits is challenging and we do not know how to approach this problem at the moment, so we leave it for future work. However, based on experiments in Section VI, where we use actual bandwidths instead of their estimates to validate convergence, we conjecture that the same results hold for both, i.e.,

Conjecture 1.

As $n\rightarrow\infty$ and $\sigma\rightarrow 0$ at appropriate rates, $\omega({\bf 1}_{S})\rightarrow\sup_{{\bf s}\in\partial S}p({\bf s})$ and $\omega({\bf 1}_{A})\rightarrow\sup_{{\bf x}\in\partial A}$ .

•

Note that Theorems 1 and 17 show pointwise convergence for fixed underlying data models, i.e., convergence is proven for a given indicator signal ${\bf 1}_{S}$ specified by $\{p({\bf x}),\partial S\}$ , and ${\bf 1}_{A}$ specified by $\{p_{A}({\bf x}),p_{A^{c}}({\bf x})\}$ . This is not sufficient when we want to interpret the behavior of a bandwidth-based learning algorithm, since we cannot guarantee that the solution returned by the algorithm matches the solution of its continuum limit version. We need stronger convergence results for this case, such as those recently covered in [5, 25, 22].

Finally, as a special case of our analysis, we also get a convergence result for the graph cut in the nonseparable model analogous to the results of [24] for the separable model. Note that the cut in this case equals the sum of weights of edges connecting points that belong to class $A$ to points that do not belong to class $A$ , i.e.,

[TABLE]

With this definition, we have the following result:

Theorem 3.

If $n\rightarrow\infty$ , $\sigma\rightarrow 0$ such that $n\sigma^{d}\rightarrow\infty$ , then

[TABLE]

The result above indicates that if the overlap between the conditional distributions of a particular class and its compliment is low, then the value of the graph cut is lower. This justifies the use of spectral clustering in the context of nonseparable models.

IV-B Label complexity

In the context of our work, we define the label complexity of learning class indicators over the graph using a sampling theoretic approach, as the fraction of labeled nodes required for perfectly predicting the labels of the unlabeled nodes. Formally, for a given class indicator ${\bf 1}_{C}\in\{0,1,\}^{n}$ over the graph $G_{n}$ , we define it as the fraction of points that need to be labeled so that a sampling theory-based reconstruction algorithm (such as bandlimited reconstruction of (5)) outputs a solution ${\bf f}^{*}$ with zero reconstruction error: $\|{\bf f}^{*}-{\bf 1}_{C}\|=0$ . Note that perfect reconstruction is a strong requirement that can be relaxed by allowing an error tolerance $\epsilon$ , in which case the amount of labeling required is lower. However, this requirement simplifies our analysis since we can directly use results from sampling theory to evaluate this quantity. Specifically, we can simply use Lemma 1 to calculate the label complexity for ${\bf 1}_{C}$ over the graph as $\frac{1}{n}{\cal N}_{\bf L}(\omega({\bf 1}_{C}))$ . In our context, label complexity is essentially an indicator of how “good” the semi-supervised problem is, i.e., how much help we get from geometry while predicting the unknown labels. A low label complexity is indicative of a favorable situation, where one is able to learn from only a few known labels by exploiting data geometry.

Note that our definition of label complexity is concerned with reconstructing class indicators only on the nodes of the graph. This pertains to the transductive learning philosophy, a common setting considered in most graph-based semi-supervised learning literature, where the goal is to simply predict the labels of the unlabeled points and not learn a general labeling rule/classifier. Further, our definition is different and simpler than the more general $(\epsilon,\delta)$ definition of sample/label complexity in Probably Approximately Correct (PAC) learning [32], i.e., it is concerned with reconstructing only a given class indicator, with zero error, using a sampling theory-based learning approach, over a graph constructed from a given data model.

Ideal label complexities

A simple way to compute the label complexity, for the data models we consider, is to find the fraction of points belonging to a region that fully encompasses the boundary. To formalize this, let us define the following two sublevel sets in $\mathbb{R}^{d}$ :

[TABLE]

Note that by definition, $\partial S$ is fully contained in ${\cal X}_{S}$ and $\partial A$ is fully contained in ${\cal X}_{A}$ (see Figure 2 for an example in $\mathbb{R}^{1}$ ). Therefore, to perfectly reconstruct the indicator signals ${\bf 1}_{S}$ and ${\bf 1}_{A}$ for any $n$ , it is sufficient to know the labels of all points in ${\cal X}_{S}$ and ${\cal X}_{A}$ , respectively, as this strategy removes all ambiguity in labeling the two classes; a good learning algorithm can simply propagate the known labels on to the unlabeled points. Based on this and using the law of large numbers, we arrive at the following conclusion:

Remark 2.

The ideal label complexities of learning ${\bf 1}_{S}$ and ${\bf 1}_{A}$ in the asymptotic limit are given by $P({\cal X}_{S})$ and $P({\cal X}_{A})$ , respectively, where $P(\Omega)=\int_{\Omega}p({\bf x})d{\bf x}$ .

Label complexity of ${\bf 1}_{S}$ and ${\bf 1}_{A}$ using a sampling theory-based approach

Note that from Lemma 1, we know that the label complexities for ${\bf 1}_{S}$ and ${\bf 1}_{A}$ are given as $\frac{1}{n}{\cal N}_{\bf L}(\omega({\bf 1}_{S}))$ and $\frac{1}{n}{\cal N}_{\bf L}(\omega({\bf 1}_{A}))$ , respectively. Since our bandwidth convergence results relate the bandwidth of indicators for the two data models with data geometry, we only need to asymptotically relate the fraction of eigenvalues of ${\bf L}$ below any constant. This is achieved by first proving the following:

Theorem 4.

Let ${\cal N}_{\bf L}(t)$ be the number of eigenvalues of ${\bf L}$ below a constant $t$ . Then, as $n\rightarrow\infty$ and $\sigma\rightarrow 0$ , we have

[TABLE]

Proof.

See Section V-F. ∎

Note that Theorem 4 can be strengthened by proving convergence of $\frac{1}{n}{\cal N}_{\bf L}(t)$ rather than its expected value. This requires further analysis, which we leave for future work. Plugging in $\omega({\bf 1}_{S})$ and $\omega({\bf 1}_{A})$ in place of $t$ in Theorem 4, and using the convergence results from Theorems 1 and 17, and Conjecture 1, we speculate the following convergence for the label complexities of ${\bf 1}_{S}$ and ${\bf 1}_{A}$ :

Conjecture 2.

As $n\rightarrow\infty$ , $\sigma\rightarrow 0$ , we have

[TABLE]

The limiting values in (25) and (26) are the same as those predicted by Remark 2; this is encouraging as far as the validity of Conjecture 2 is concerned. Additionally, we see strong evidence in our experiments to support our claims; specifically, the average error of predicting the labels of the unlabeled nodes goes to zero as the fraction of labeled examples crosses the limit values of (25) and (26) (see Figure 7).

The limiting values in (25) and (26) essentially indicate how the low density separation assumption can benefit semi-supervised learning, since in this case, one can forgo the task of labeling a significant fraction of the points and still reconstruct the indicator by exploiting data geometry. A classic example of where this can be useful is the two-step learning process, where the first step uses semi-supervised learning in a transductive setting to create a large training set using a combination of unlabeled and labeled data, and the second step involves learning a classifier using supervised learning. If the low density separation is satisfied by the data, then semi-supervised learning using a sampling theory-based approach effectively reduces the sample complexity of the supervised learning step by a constant fraction, equal to the limiting values in (25) and (26).

V Proofs

We now present the proofs111A partial sketch of the proof for the separable model is also provided in our parallel work [26]; here we provide the complete proof. of Theorems 1 and 17. The main idea is to perform a variance-bias decomposition of the bandwidth estimate and then prove the convergence of each term independently. Specifically, for any indicator vector ${\bf 1}_{R}\in\{0,1\}^{n}$ , we consider the random variable:

[TABLE]

We study the convergence of this quantity by considering the numerator and denominator separately (it is easy to show that the fraction converges if both the numerator and denominator converge). By the strong law of large numbers, the following can be concluded for the denominator as $n\rightarrow\infty$ :

[TABLE]

where $a.s.$ denotes almost sure convergence. For the numerator, we decompose it into two parts – a variance term for which we show stochastic convergence using a concentration inequality, and a bias term for which we prove deterministic convergence.

V-A Expansion of $\frac{1}{n}{\bf 1}_{R}^{T}{\bf L}^{m}{\bf 1}_{R}$

Let $V:=\frac{1}{n}{\bf 1}_{R}^{T}{\bf L}^{m}{\bf 1}_{R}$ . We begin by expanding $V$ as

[TABLE]

where ${\bf B}_{k}$ denotes the $k^{\text{th}}$ term out of the $2^{m}$ terms in the expansion of $({\bf D}-{\bf W})^{m}$ . ${\bf B}_{k}$ is composed of a product of $m$ matrices, each of which can be either ${\bf D}$ or $-{\bf W}$ . In order to write it down explicitly, one can use the $m$ -bit binary representation of the index $k$ and replace [math]s with ${\bf D}$ and $1$ s with $-{\bf W}$ , i.e., if $b_{v}(k)$ denotes the $v^{\text{th}}$ most-significant bit in the $m$ -bit binary representation of $k$ for $v\in\{1,\dots,m\}$ and $s(k)$ denotes the number of ones in it (i.e., $s(k):=\sum_{v=1}^{m}b_{v}(k)$ ), then

[TABLE]

where the product notation assumes that the ordering of the matrices is kept fixed, i.e., $\prod_{p=1}^{m}{\bf A}_{p}={\bf A}_{1}{\bf A}_{2}\dots\bf A_{m}$ .

Noting that ${\bf D}$ and ${\bf W}$ are composed of the edge weights $w_{ij}=\frac{1}{(2\pi\sigma^{2})^{d/2}}K({\bf X}_{i},{\bf X}_{j})$ , we now describe how to expand the quadratic form $V$ by considering each term ${\bf 1}_{R}^{T}{\bf B}_{k}{\bf 1}_{R}$ individually:

The sign of the term ${\bf 1}_{R}^{T}{\bf B}_{k}{\bf 1}_{R}$ is determined by the number of $(-{\bf W})$ matrices in the product ${\bf B}_{k}$ . 2. 2.

By using the definitions of ${\bf D}$ and ${\bf W}$ in the product expansion of ${\bf B}_{k}$ , the absolute value of ${\bf 1}_{R}^{T}{\bf B}_{k}{\bf 1}_{R}$ can be expressed through the following template:

[TABLE]

where $({\bf 1}_{R})_{i}$ denotes the $i^{\text{th}}$ element of of the indicator vector, and the locations with a “ $*$ ” need to be filled with appropriate indices in $\{i_{1},\dots,i_{m+1}\}$ . Note that the template consists of a product of $m$ edge weights $w_{ij}$ , each contributed by either a ${\bf D}$ or ${\bf W}$ depending on its location in the expression. 3. 3.

By performing an explicit matrix multiplication, we fill the locations from left to right one-by-one using the following rule: let a term containing a $*$ be preceded by an edge-weight $w_{ab}$ , then,

•

If $w_{ab}$ is contributed by ${\bf D}$ , then $*=a$ .

•

If $w_{ab}$ is contributed by ${\bf W}$ , then $*=b$ .

Since the binary representation of $k$ is closely tied to the ordering of ${\bf D}$ and ${\bf W}$ in the product term ${\bf B}_{k}$ , we can once again use it to explicitly express ${\bf 1}_{R}^{T}{\bf B}_{k}{\bf 1}_{R}$ . In order to populate any “*” location according to the rules above, we require a quantity that depends on the position of the last occurring ${\bf W}$ with respect to any location in the product expression of ${\bf B}_{k}$ . Therefore, using the $m$ -bit binary representation of $k$ , we define for location $u\in\{1,\dots,m\}$ :

[TABLE]

where $\max(\{.\})$ returns the maximum element in a set of numbers. The template described in (31) can then be completed using the rules to obtain

[TABLE]

Finally, the expansion of $V$ can be obtained by summing the $2^{m}$ quadratic forms in (29):

[TABLE]

where we defined

[TABLE]

V-B Convergence of variance terms

For $V=\frac{1}{n}{\bf 1}_{R}^{T}{\bf L}^{m}{\bf 1}_{R}$ , we have the following concentration result:

Lemma 2 (Concentration).

For every $\epsilon>0$ , we have:

[TABLE]

where $C=2/(2\pi)^{d/2}$ .

Proof.

Note that the expansion of $V$ in (34) has the form of a V-statistic. Further, as defined in (35), $g$ is composed of a sum of $2^{m}$ terms, each a product of $m$ kernel functions $K$ that are non-negative. Therefore, we have the following upper bound:

[TABLE]

In order to apply a concentration inequality for V, we first re-write it in the form of a U-statistic by regrouping terms in the summation in order to remove repeated indices, as given in [33]:

[TABLE]

where $\sum_{(n,m+1)}$ denotes summation over all ordered (m+1)-tuples $(i_{1},\dots,i_{m+1})$ of distinct indices taken from the set $\{1,\dots,n\}$ , $n^{(m+1)}=n.(n-1)\dots(n-m)$ is the falling factorial (or number of (m+1)-permutations of $n$ ) and $g^{*}$ is a weighted arithmetic mean of specific instances of $g$ that avoids repeating indices:

[TABLE]

where $\sum_{(j)}^{*}$ denotes summation over all $(m+1)$ -tuples $(l_{1},l_{2},\dots,l_{m+1})$ formed from $\{1,\dots,j\}$ with exactly $j$ distinct indices. Note that the number of such $(m+1)$ -tuples is given by $\genfrac{\{}{\}}{0.0pt}{}{m+1}{j}$ , which is a Stirling number of the second kind. Hence, we have

[TABLE]

where we used the property $\sum_{j=0}^{m+1}n^{(j)}\genfrac{\{}{\}}{0.0pt}{}{m+1}{j}=n^{m+1}$ . Therefore, $g^{*}$ has the same upper bound as that of $g$ derived in (37). Moreover, using the fact that $\mathbb{E}\left\{{V}\right\}=\mathbb{E}\left\{{g^{*}({\bf X}_{i_{1}},{\bf X}_{i_{2}},\dots,{\bf X}_{i_{m+1}})}\right\}$ , we can bound the variance of $g^{*}$ as

[TABLE]

Finally, plugging in the bound and variance of $g^{*}$ in Bernstein’s inequality for U-statistics as stated in [33, 31], we arrive at the desired result of (36). ∎

Note that as $n\rightarrow\infty$ and $\sigma\rightarrow 0$ with rates satisfying $(n\sigma^{md})/(mC^{m})\rightarrow\infty$ , we have $P(|V-\mathbb{E}\left\{{V}\right\}|>\epsilon)\rightarrow 0$ for all $\epsilon>0$ . The continuous mapping theorem then allows us to conclude that $V^{1/m}\xrightarrow[]{p.}(\mathbb{E}\left\{{V}\right\})^{1/m}$ .

V-C Expansion of $\mathbb{E}\left\{{\frac{1}{n}{\bf 1}_{R}^{T}{\bf L}^{m}{\bf 1}_{R}}\right\}$

The V-statistic expansion of $V=\frac{1}{n}{\bf 1}_{R}^{T}{\bf L}^{m}{\bf 1}_{R}$ in (34) has summands with repeating indices, hence we first define a U-statistic counterpart that avoids these repetitions:

[TABLE]

where $g({\bf X}_{i_{1}},{\bf X}_{i_{2}},\dots,{\bf X}_{i_{m+1}})$ are the kernels defined in (35), and the definitions of $\sum_{(n,m+1)}$ and $n^{(m+1)}$ are the same as those for (38). The $U$ -statistic definition is convenient since

[TABLE]

as opposed to $\mathbb{E}\left\{{V}\right\}$ , where one would have to deal with terms with repeated indices separately. Further, note that

[TABLE]

where $\sum_{(n,m+1)^{*}}$ denotes summation over all ordered $(m+1)$ -tuples $(i_{1},\dots,i_{m+1})$ of indices obtained from $\{1,2,\dots,n\}$ such that at least two of them are equal. Note that there are $n^{m+1}-n^{(m+1)}$ terms in the summation $\sum_{(n,m+1)^{*}}$ . Therefore, we have

[TABLE]

where we used $n^{m+1}-n^{(m+1)}=O(m^{2}n^{m})$ , $\mathbb{E}\left\{{g}\right\}\leq\|g\|_{\infty}$ and $\|g\|_{\infty}=\frac{C^{m}}{\sigma^{md}}$ from (37).

We now focus on computing $\mathbb{E}\left\{{g({\bf X}_{i_{1}},{\bf X}_{i_{2}},\dots,{\bf X}_{i_{m+1}})}\right\}$ . Based on (35), we can express it as follows:

[TABLE]

where we define:

[TABLE]

with $c_{u}(k)$ defined as in (32).

V-D Convergence of bias term for the separable model

To evaluate the convergence of bias terms, we shall require the following properties of the $d$ -dimensional Gaussian kernel:

Lemma 3.

If $p({\bf x})$ is twice differentiable, then

[TABLE]

Proof.

Using the substitution ${\bf y}={\bf x}+{\bf t}$ followed by a Taylor series expansion about ${\bf x}$ , we have

[TABLE]

where ${\rm Tr}(.)$ denotes the trace of a matrix, and the third step follows from simple component-wise integration. ∎

Lemma 4.

If $p({\bf x})$ is twice differentiable, then

[TABLE]

Proof.

Note that

[TABLE]

Therefore, we have

[TABLE]

where the last step follows from Lemma 3. ∎

In order to prove convergence for the separable model, we need the following results:

Lemma 5.

If $p({\bf x})$ is Lipschitz continuous, then for a smooth hypersurface $\partial S$ that divides $\mathbb{R}^{d}$ into $S_{1}$ and $S_{2}$ , and whose curvature has radius lower-bounded by $\tau>0$ ,

[TABLE]

where $\alpha$ and $\beta$ are positive integers. Moreover, for positive integers $a,b$ , and $\alpha,\beta,\alpha^{\prime},\beta^{\prime}$ such that $\alpha+\beta=\alpha^{\prime}+\beta^{\prime}=\gamma$ , we have:

[TABLE]

Proof.

See Appendix A. ∎

We now prove the deterministic convergence of $\mathbb{E}\left\{{\frac{1}{n}{\bf 1}_{S}^{T}{\bf L}^{m}{\bf 1}_{S}}\right\}$ in the following lemma:

Lemma 6.

As $n\rightarrow\infty$ , $\sigma\rightarrow 0$ such that $m2^{m}\sigma\rightarrow 0$ and $\frac{m^{2}C^{m}}{n\sigma^{md+1}}\rightarrow 0$ , we have

[TABLE]

where $t(m)=\sum_{r=0}^{m-1}\binom{m-1}{r}(-1)^{r}(\sqrt{r+1}-\sqrt{r})$ .

Proof.

Using (45) and (46), and replacing ${\bf 1}_{R}$ with ${\bf 1}_{S}$ , we have

[TABLE]

We pair all even-indexed and odd-indexed terms together to rewrite the summation as:

[TABLE]

Now, $h_{0}$ and $h_{1}$ can be evaluated by repeatedly applying (48) for every Gaussian kernel in the definition from (47). Hence, for the first summation pair, we obtain:

[TABLE]

For the rest of the terms, we also require the use of (49). However, in this case, we encounter several terms of the form $p(\theta{\bf x}+(1-\theta){\bf y})$ for some $\theta\in[0,1]$ . Since $m\sigma^{2}\rightarrow 0$ and $p({\bf x})$ is assumed to be Lipschitz continuous, we can approximate such terms by $p({\bf x})$ or $p({\bf y})$ . Further, the number of times we have to apply (49) in any $h_{k}$ is equal to the number of occurrences of ${\bf W}$ in ${\bf B}_{k}$ (which is $s(k)$ ). Therefore, for $1\leq l\leq 2^{m-1}-1$ , we have

[TABLE]

where $\alpha,\beta,\alpha^{\prime},\beta^{\prime}$ are positive integers such that $\alpha+\beta=\alpha^{\prime}+\beta^{\prime}=m+1$ . Plugging (55) and (56) into (53), we get:

[TABLE]

where we grouped terms based on $r=s(2l)$ in the summation (note that there are $\binom{m-1}{r}$ for a given $r$ ).

Using Lemma 5, we conclude that the right hand side of (57) converges as $n\rightarrow\infty$ and $\sigma\rightarrow 0$ to

[TABLE]

which is the desired result. ∎

Using the continuous mapping theorem on (52), we can conclude

[TABLE]

Finally, we note that as $m\rightarrow\infty$ , we have

[TABLE]

Therefore, we conclude for the separable model

[TABLE]

V-E Convergence of bias term for the nonseparable model

For the nonseparable model, we need to prove convergence of $\mathbb{E}\left\{{\frac{1}{n}{\bf 1}_{A}^{T}{\bf L}^{m}{\bf 1}_{A}}\right\}$ . This is illustrated in the following lemma:

Lemma 7.

As $n\rightarrow\infty$ , $\sigma\rightarrow 0$ such that $m2^{m}\sigma^{2}\rightarrow 0$ and $\frac{m^{2}C^{m}}{n\sigma^{md}}\rightarrow 0$ , we have

[TABLE]

Proof.

Similar to the proof of Lemma 6, we use (45) and (46), and replace ${\bf 1}_{R}$ with ${\bf 1}_{A}$ to obtain

[TABLE]

Using (48) repeatedly in the definition (47), we get

[TABLE]

where we used the fact that $p({\bf x})=\alpha_{A}p_{A}({\bf x})+\alpha_{A^{c}}p_{A^{c}}({\bf x})$ . Similarly, for $1\leq l\leq 2^{m-1}-1$ , we have

[TABLE]

Putting together (63) and (64) into (62), we get

[TABLE]

Taking limits while satisfying the stated rate conditions, we get the desired result. ∎

We finally note that as $m\rightarrow\infty$ , we have

[TABLE]

Therefore, we conclude for the nonseparable model

[TABLE]

Note that Lemma 7 for the special case of $m=1$ yields

[TABLE]

which proves Theorem 3.

V-F Proof of Theorem 4

We begin by recalling the definition of the empirical spectral distribution (ESD) of ${\bf L}$ :

[TABLE]

where $\{\lambda_{i}\}$ are the eigenvalues of ${\bf L}$ . For each $x$ , $\mu_{n}(x)$ is a function of ${\bf X}_{1},\dots,{\bf X}_{n}$ , and thus a random variable. Note that the fraction of eigenvalues of ${\bf L}$ below a constant $t$ , and its expected value can be computed from the ESD as

[TABLE]

Therefore, to understand the behavior of the expected fraction of eigenvalues of ${\bf L}$ below $t$ , we need to analyze the convergence of the expected ESD in the asymptotic limit. The idea is to show the convergence of the moments of $\mathbb{E}\left\{{\mu_{n}(x)}\right\}$ to the moments of a limiting distribution $\mu(x)$ . Then, by a standard convergence result, $\mathbb{E}\left\{{\mu_{n}(I)}\right\}\rightarrow\mu(I)$ for intervals $I$ . More precisely, let the $\Rightarrow$ symbol denote weak convergence of measures, then we use the following result that follows from the Weierstrass approximation theorem:

Lemma 8.

Let $\mu_{n}$ be a sequence of probability measures and $\mu$ be a compactly supported probability measure. If $\int x^{m}\mu_{n}(dx)\rightarrow\int x^{m}\mu(dx)$ for all $m\geq 1$ , then $\mu_{n}\Rightarrow\mu$ .

We then use the following result on equivalence of different notions of weak convergence of measures [34, Theorem 25.2] in order to prove our result for cumulative distribution functions.

Lemma 9.

$\mu_{n}\Rightarrow\mu$ * if and only if $\mu_{n}(A)\rightarrow\mu(A)$ for every $\mu$ -continuity set $A$ .*

Therefore, we simply need to analyze the convergence of moments of $\mathbb{E}\left\{{\mu_{n}(x)}\right\}$ . Note that the $m^{\text{th}}$ moment of $\mathbb{E}\left\{{\mu_{n}(x)}\right\}$ can be written as:

[TABLE]

We reuse our analysis in Section V-A, specifically the expansion in (29) to obtain

[TABLE]

Using the binary representation of $k$ once again similar to (33), we can compute:

[TABLE]

Note that $\text{Tr}\left({{\bf B}_{k}}\right)$ has a summation over $m$ indices for $k>1$ , as a result, a factor of $\frac{1}{n}$ remains in the expectation. Similarly, terms with repeated indices disappear and thus, we have the following for the right hand side of (72) as $n\rightarrow\infty$ :

[TABLE]

Using (48) repeatedly in the equation above, we get:

[TABLE]

Therefore, as $n\rightarrow\infty$ and $\sigma\rightarrow 0$ , we have:

[TABLE]

From the right hand side of the equation above, we conclude that the $m^{\text{th}}$ moment of the expected ESD of ${\bf L}$ converges to the $m^{\text{th}}$ moment of the distribution of a random variable $Y=p({\bf X})$ , where $p({\bf x})$ is the probabilty density function of ${\bf X}$ . Moreover, since $p_{Y}(y)$ has compact support, $\mathbb{E}\left\{{\mu_{n}(x)}\right\}$ converges weakly to the probability density function of $p_{Y}(y)$ . Hence, the following can be said about the expected fraction of eigenvalues of ${\bf L}$ :

[TABLE]

This proves our claim in Theorem 4. Note that, to prove the stochastic convergence of the fraction itself rather than its expected value, we would need a condition similar to those in Theorems 1 and 17 to hold for each moment. In that case, $\sigma$ will go to 0 in a prohibitively slow fashion. We believe that this is an artifact of the methods we employ for proving the result. Hence, our conjecture is that the convergence result holds for $\frac{1}{n}\mathcal{N}_{\bf L}(t)$ itself, and we leave the analysis of this statement for future work.

VI Numerical validation

We now present simple numerical experiments222Link to code: https://github.com/aamiranis/asymptotics_graph_ssl to validate our results and demonstrate their usefulness in practice. A key focus in our experiments is to confirm Conjecture 1, i.e., the convergence results for the bandwidth estimates also hold for the actual bandwidths. In order to achieve this, we work directly with the bandwidths of the indicators instead of their estimates and numerically validate their convergence for both the separable and nonseparable models.

For simulating the separable model, we first consider a data distribution based on a 2D Gaussian Mixture Model (GMM) with two Gaussians: $\mu_{1}=[-1,\;0],\Sigma_{1}=0.25{\bf I}$ and $\mu_{2}=[1,\;0],\Sigma_{2}=0.16{\bf I}$ , and mixing proportions $\alpha_{1}=0.4$ and $\alpha_{2}=0.6$ respectively. The probability density function is illustrated in Figure 3. Next, we evaluate the claim of Theorem 1 on five boundaries, described in Table II. These boundaries are depicted in Figure 4 and are illustrative of typical separation assumptions such as linear or non-linear and low or high density.

For simulating the nonseparable model, we first construct the following smooth (twice-differentiable) 2D probability density function

[TABLE]

Note that data points $(X,Y)$ can be sampled from this distribution by setting the coordinates $X=\sqrt{1-U^{1/4}}\cos(2\pi V)$ , $Y=\sqrt{1-U^{1/4}}\sin(2\pi V)$ , where $U,V\sim\text{Uniform}(0,1)$ . We then use $q(x,y)$ to define a nonseparable 2D model with mixture density $p(x,y)=\alpha_{A}p_{A}(x,y)+\alpha_{A^{c}}p_{A^{c}}(x,y)$ , where $p_{A}(x,y)=q(x-0.75,y)$ , $p_{A^{c}}(x,y)=q(x+0.75,y)$ and $\alpha_{A}=\alpha_{A^{c}}=0.5$ . The probability density function is illustrated in Figure 3. The overlap region or boundary $\partial A$ for this model is given by

[TABLE]

Further, for this model, we have $\sup_{\partial A}p({\bf x})=0.2517$ .

In our first experiment, we validate the statements of Theorems 1 and 17 by comparing the left and right hand sides of (16) and (17) for corresponding boundaries. This is carried out in the following way: we draw $n=2500$ points from each model and construct the corresponding similarity graphs using $\sigma=0.1$ . Then, for the boundaries $\partial S_{i}$ in the separable model and $\partial A$ in the nonseparable model, we carry out the following steps:

We first construct the indicator functions ${\bf 1}_{S_{i}}$ and ${\bf 1}_{A}$ on the corresponding graphs. 2. 2.

We then compute the empirical bandwidth $\omega({\bf 1}_{S_{i}})$ and $\omega({\bf 1}_{A})$ in a manner that takes care of numerical error: we first obtain the eigenvectors of the corresponding ${\bf L}$ , then set $\omega({\bf 1}_{S_{i}})$ and $\omega({\bf 1}_{A})$ to be $\nu$ for which energy contained in the graph Fourier coefficients corresponding to eigenvalues $\lambda_{j}>\nu$ is at most $0.01\%$ , i.e.,

[TABLE]

The procedure above is repeated 100 times and the mean of $\omega({\bf 1}_{S_{i}})$ and $\omega({\bf 1}_{A})$ are compared with $\sup_{{\bf s}\in\partial S_{i}}p({\bf s})$ and $\sup_{{\bf x}\in\partial A}p({\bf x})$ respectively. The result is plotted in Figure 5. We observe that the empirical bandwidth is close to the theoretically predicted value and has a very low standard deviation. This supports our conjecture that stochastic convergence should hold for the bandwidth. To further justify this claim, we study the behavior of the standard deviation of $\omega({\bf 1}_{S_{i}})$ and $\omega({\bf 1}_{A})$ as a function of $n$ in Figure 6, where we observe a decreasing trend consistent with our result.

For our second experiment, we validate the label complexity of sampling theory-based learning in Conjecture 2 by reconstructing the indicator function corresponding to $\partial S_{3}$ and $\partial A$ from a fraction of labeled examples on the corresponding graphs. This is carried out as follows: For a given budget $l$ , we find the set of points $L\subset\{1,2,\dots,n\}$ to label of size $|L|=l$ , using pivoted column-wise Gaussian elimination on the eigenvector matrix ${\bf U}$ of ${\bf L}$ [15]. This method ensures that the obtained labeled set guarantees perfect recovery for signals spanned by the first $l$ eigenvectors of ${\bf L}$ [15]. We then recover the indicator functions from these labeled sets by solving the least squares problem in (5) followed by thresholding. Note that $\theta$ is set to the cutoff frequency $\omega_{c}(L)$ of $L$ , which is equal to the $l^{\text{th}}$ eigenvalue of ${\bf L}$ . The mean reconstruction error is defined as

[TABLE]

We repeat the experiment $100$ times by generating different graphs and plot the averaged $E_{\rm mean}$ against the fraction of labeled examples. The result is illustrated in Figure 7. We observe that the error goes to zero as the fraction of labeled points goes beyond the respective limit values stated in (25) and (26). This reinforces the intuition that the bandwidth of class indicators and their label complexities are closely linked with the inherent geometry of the data.

VII Discussions and future work

In this paper, we provided an interpretation of the graph sampling theoretic approach to semi-supervised learning. Our work analyzed the bandwidth of class indicator signals with respect to the Laplacian eigenvector basis and revealed its connection to the underlying geometry of the dataset. This connection is useful in justifying graph-based approaches for semi-supervised and unsupervised learning problems, and provides a geometrical interpretation of the smoothness assumptions imposed in the bandlimited reconstruction approach. Specifically, our results have shown that an estimate of the bandwidth of class indicators converges to the supremum of the probability density on the class boundaries for the separable model, and on the overlap regions for the nonseparable model. This quantifies the connection between the assumptions of smoothness (in terms of bandlimitedness) and low density separation, since boundaries passing through regions of low data density result in lower bandwidth of the class indicator signals. We numerically validated these results through various experiments.

There are several directions in which our results can be extended. In this paper we only considered Gaussian-weighted graphs, an immediate extension would be to consider arbitrary kernel functions for computing graph weights, or density dependent edge-connections such as $k$ -nearest neighbors. Another possibility is to consider data defined on a subset of the $d$ -dimensional Euclidean space.

Our analysis also sheds light on the label complexity of graph-based semi-supervised learning problems. We showed that perfect prediction from a few labeled examples using a graph-based bandlimited interpolation approach requires the same amount of labeling as one would need to completely encompass the boundary or region of ambiguity. This quantifies the connection between label complexity of a sampling theory-based approach with the underlying geometry of the problem. We believe that the main potential of graph-based methods will be apparent in situations where one can tolerate a certain amount of prediction error, in which case such approaches shall require fewer labeled data. We plan to investigate this as part of future work.

Appendix A Proof of Lemma 5

The key ingredient required for evaluating the integrals in Lemma 5 involves selecting a radius $R$ ( $<\tau$ ) as a function of $\sigma$ that satisfies the following properties as $\sigma\rightarrow 0$ :

$R\rightarrow 0$ , 2. 2.

$R/\sigma\rightarrow\infty$ , 3. 3.

$R^{2}/\sigma\rightarrow 0$ , 4. 4.

$\epsilon_{R}/\sigma\rightarrow 0$ , where $\epsilon_{R}:=\int_{\|{\bf z}\|>R}K_{\sigma^{2}}({\bf 0},{\bf z})d{\bf z}$ .

A particular choice of $R$ is given by $R=\sqrt{d\sigma^{2}\log{(1/\sigma^{2})}}$ . Note that $R\rightarrow 0$ as $\sigma\rightarrow 0$ . Further,

[TABLE]

Hence, $R/\sigma\rightarrow\infty$ and $R^{2}/\sigma\rightarrow 0$ as $\sigma\rightarrow 0$ . Additionally, substituting the expression for $R$ in the tail bound for the norm of a $d$ -dimensional Gaussian vector gives us:

[TABLE]

Therefore, for $d>1$ , $\epsilon_{R}/\sigma\rightarrow 0$ as $\sigma\rightarrow 0$ . Further, it is easy to ensure $R<\tau$ for the regime of $\sigma$ in our proofs.

We now consider the proof of equation (50), let

[TABLE]

Further, let $[S_{1}]_{R}$ indicate a tubular region of thickness $R$ adjacent to the boundary $\partial S$ in $S_{1}$ , i.e., the set of points in $S_{1}$ at a distance $\leq R$ from the boundary. Then, we have

[TABLE]

$E_{1}$ is the error associated with approximating $I$ by $I_{1}$ and exhibits the following behavior:

Lemma 10.

$\lim_{\sigma\rightarrow 0}E_{1}=0$ .

Proof.

Note that

[TABLE]

Using $\lim_{\sigma\rightarrow\infty}\epsilon_{R}/\sigma=0$ , we get the desired result. ∎

In order to analyze $I_{1}$ , we need to define certain geometrical constructions (illustrated in Figure 8) as follows:

Definition 1.

For each ${\bf x}_{1}\in[S_{1}]_{R}$ , we define a transformation of coordinates as:

[TABLE]

*where ${\bf s}_{1}$ is the foot of the perpendicular dropped from ${\bf x}_{1}$ onto $\partial S$ , $r_{1}$ is the distance between ${\bf s}_{1}$ and ${\bf x}_{1}$ , and ${\bf n}({\bf s}_{1})$ is the surface normal at ${\bf s}_{1}$ (towards the direction of ${\bf x}_{1}$ ). Since the minimum radius of curvature of $\partial S$ is $\tau$ and $R<\tau$ , this mapping is injective. * 2. 2.

For each ${\bf s}_{1}\in\partial S$ , let $H_{{\bf s}_{1}}^{+}$ denote the half-space created by the plane tangent on ${\bf s}_{1}$ and on the side of $S_{2}$ . Similarly, let $H_{{\bf s}_{1}}^{-}$ denote the half-space on the side of $S_{1}$ , that is, $H_{{\bf s}_{1}}^{-}=\mathbb{R}^{d}\setminus H_{{\bf s}_{1}}^{+}$ . 3. 3.

Let $W_{{\bf s}_{1}}^{+}(x)$ denote an infinite slab of thickness $x$ tangent to $\partial S$ at ${\bf s}_{1}$ and towards the side of $S_{2}$ . Let $W_{{\bf s}_{1}}^{-}(y)$ denote a similar slab of thickness $y$ on the side of $S_{1}$ . 4. 4.

Finally, for any ${\bf x}$ , let $B({\bf x},R)$ denote the Euclidean ball of radius $R$ centered at ${\bf x}$ .

We now consider $I_{1}$ , the main idea here is to approximate the integral over $S_{2}$ by an integral over the half-space $H^{+}_{{\bf s}_{1}}$ . Hence, we have:

[TABLE]

where $E_{2}$ is the error associated with the approximation. Therefore, we have

[TABLE]

We now show that as $\sigma\rightarrow 0$ , $I_{2}\rightarrow\frac{1}{\sqrt{2\pi}}\int_{\partial S}p^{\alpha+\beta}({\bf s})d{\bf s}$ , and $E_{2}\rightarrow 0$ .

Lemma 11.

$\lim_{\sigma\rightarrow 0}I_{2}=\frac{1}{\sqrt{2\pi}}\int_{\partial S}p^{\alpha+\beta}({\bf s})d{\bf s}$ .

Proof.

Using the change of coordinates ${\bf x}_{1}={\bf s}_{1}+r_{1}{\bf n}({\bf s}_{1})$ , we have

[TABLE]

where $J({\bf s}_{1},r_{1})$ denotes the Jacobian of the transformation. Now, an arc $\widehat{PQ}$ of length $ds$ at a distance $r_{1}$ away from $\partial S$ gets mapped to an arc $\widehat{P^{\prime}Q^{\prime}}$ on $\partial S$ whose length lies in the interval $[ds(1-\frac{r_{1}}{\tau}),ds(1+\frac{r_{1}}{\tau})]$ . Therefore, for all points within $[S_{1}]_{R}$ , we have

[TABLE]

Further, since $p({\bf x})$ is Lipschitz continuous with constant $L_{p}$ , $p^{\alpha}({\bf x})$ is also Lipschitz continuous with constant $L_{p,\alpha}$ . Therefore, for any ${\bf x}_{1}\in[S_{1}]_{R}$ , we have $p^{\alpha}({\bf x}_{1})=p^{\alpha}({\bf s}_{1})+L_{p,\alpha}R$ . This leads to the following simplification for $I_{2}$ :

[TABLE]

where we defined

[TABLE]

Note that every ${\bf x}_{2}\in H_{{\bf s}_{1}}^{+}$ can be written as ${\bf s}_{2}+r_{2}{\bf n}({\bf s}_{2})$ , where ${\bf n}({\bf s}_{2})=-{\bf n}({\bf s}_{1})$ . Hence, we get

[TABLE]

where we used Lipschitz continuity of $p^{\beta}({\bf x})$ in the second equality and applied Lemma 3 to arrive at the last step. Further, using the definition of the $Q$ -function and integration by parts, we note that

[TABLE]

Therefore,

[TABLE]

Combining (95) and (98) and using the fact that $R/\sigma\rightarrow\infty$ as $\sigma\rightarrow 0$ (from the definition of $R$ ), we get

[TABLE]

which concludes the proof. ∎

We now consider the error term $E_{2}$ and prove the following result:

Lemma 12.

$\lim_{\sigma\rightarrow 0}E_{2}=0$ .

Proof.

Let us first rewrite $E_{2}$ as follows:

[TABLE]

where we defined

[TABLE]

The key idea is to lower and upper bound $I_{4}({\bf x}_{1})$ for all ${\bf x}_{1}$ using worst case scenarios and evaluate the limits of the bounds. Note that $I_{4}({\bf x}_{1})$ is largest in magnitude when $S_{1}$ or $S_{2}$ is a sphere of radius $\tau$ , as illustrated in Figures 9(a) and 9(b). We now make certain geometrical observations. For any ${\bf x}_{1}={\bf s}_{1}+r_{1}{\bf n}({\bf s}_{1})\in[S_{1}]_{R}$ , we observe from Figure 9(b) that

[TABLE]

where $R^{\prime}=\frac{R^{2}}{2(\tau-R)}$ . Similarly, from Figure 9(a), we observe that

[TABLE]

Substituting these in (100) and using a simplification similar to that of $I_{2}$ in (95), we get

[TABLE]

where we defined

[TABLE]

Similar to the evaluation of $I_{3}({\bf s}_{1})$ in (97), we have

[TABLE]

We now evaluate the two 1-D integrals as follows:

[TABLE]

Similarly,

[TABLE]

Noting that as $\sigma\rightarrow 0$ , $R/\sigma\rightarrow\infty$ and $R^{\prime}/\sigma\rightarrow 0$ , we conclude that $\lim_{\sigma\rightarrow 0}E_{2}=0$ . ∎

The proof of (51) proceeds in a similar fashion by approximating the inner integral using hyperplanes. Specifically, similar to the proof of (50), we can show that the integral on the left hand side can be written as $I+E$ , where

[TABLE]

and $E$ is the residual associated with the approximation that can be shown to go to zero as $\sigma\rightarrow 0$ (we skip this proof since it is quite similar to the analysis for (50)). In order to evaluate $I$ , we perform a change of coordinates ${\bf x}_{1}={\bf s}_{1}+r_{1}{\bf n}({\bf s}_{1})$ as before to obtain

[TABLE]

where we defined

[TABLE]

By using a change of coordinates for ${\bf x}_{2}$ similar to the steps in (97), we obtain

[TABLE]

The 1-D integrals can be evaluated as follows:

[TABLE]

Using the fact that $\alpha+\beta=\alpha^{\prime}+\beta^{\prime}=\gamma$ , and taking the limit $\sigma\rightarrow 0$ after putting everything together, we conclude

[TABLE]

Bibliography34

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] O. Chapelle, B. Schölkopf, and A. Zien, Semi-Supervised Learning (Adaptive Computation and Machine Learning) . The MIT Press, 2006.
2[2] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf, “Learning with local and global consistency,” in Advances in Neural Information Processing Systems 16 (S. Thrun, L. K. Saul, and B. Schölkopf, eds.), pp. 321–328, MIT Press, 2004.
3[3] H. Narayanan, M. Belkin, and P. Niyogi, “On the relation between low density separation, spectral clustering and graph cuts,” in Advances in Neural Information Processing Systems (NIPS) 19 , 2006.
4[4] X. Zhu, Z. Ghahramani, and J. Lafferty, “Semi-supervised learning using gaussian fields and harmonic functions,” in IN ICML , pp. 912–919, 2003.
5[5] N. García Trillos and D. Slepčev, “Continuum limit of total variation on point clouds,” Archive for Rational Mechanics and Analysis , vol. 220, pp. 193–241, Apr 2016.
6[6] X. Zhou and M. Belkin, “Semi-supervised learning by higher order regularization,” in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011 , pp. 892–900, 2011.
7[7] T. Bühler and M. Hein, “Spectral clustering based on the graph p-laplacian,” in Proceedings of the 26th Annual International Conference on Machine Learning , ICML ’09, (New York, NY, USA), pp. 81–88, ACM, 2009.
8[8] A. E. Alaoui, “Asymptotic behavior of ℓ p subscript ℓ 𝑝 \ell_{p} -based Laplacian regularization in semi-supervised learning,” in Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 23-26, 2016 , pp. 879–906, 2016.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

A Sampling Theory Perspective of Graph-based Semi-supervised Learning

Abstract

I Introduction

II Preliminaries

II-A Data models

II-A1 The separable model

II-A2 The nonseparable model

II-B Graph construction

II-C Graph sampling theory: bandwidth, bandlimited reconstruction and label complexity

Lemma 1**.**

Proof.

II-D Estimating bandwidth for graph signals

II-E Focus of this paper

III Related work and connections

III-A Classification setting

III-B Regression setting

IV Main results and Discussion

IV-A Interpretation of bandwidth and bandlimited reconstruction

Theorem 1**.**

Theorem 2**.**

Remark 1**.**

Conjecture 1**.**

Theorem 3**.**

IV-B Label complexity

Ideal label complexities

Remark 2**.**

Label complexity of 1S{\bf 1}_{S}1S​ and 1A{\bf 1}_{A}1A​ using a sampling theory-based approach

Theorem 4**.**

Proof.

Conjecture 2**.**

V Proofs

V-A Expansion of 1n1RTLm1R\frac{1}{n}{\bf 1}_{R}^{T}{\bf L}^{m}{\bf 1}_{R}n1​1RT​Lm1R​

V-B Convergence of variance terms

Lemma 2** (Concentration).**

Proof.

V-C Expansion of E{1n1RTLm1R}\mathbb{E}\left\{{\frac{1}{n}{\bf 1}_{R}^{T}{\bf L}^{m}{\bf 1}_{R}}\right\}E{n1​1RT​Lm1R​}

V-D Convergence of bias term for the separable model

Lemma 3**.**

Proof.

Lemma 4**.**

Proof.

Lemma 5**.**

Proof.

Lemma 6**.**

Proof.

V-E Convergence of bias term for the nonseparable model

Lemma 7**.**

Proof.

V-F Proof of Theorem 4

Lemma 8**.**

Lemma 9**.**

VI Numerical validation

VII Discussions and future work

Appendix A Proof of Lemma 5

Lemma 10**.**

Proof.

Definition 1**.**

Lemma 11**.**

Proof.

Lemma 12**.**

Proof.

Lemma 1.

Theorem 1.

Theorem 2.

Remark 1.

Conjecture 1.

Theorem 3.

Remark 2.

Label complexity of ${\bf 1}_{S}$ and ${\bf 1}_{A}$ using a sampling theory-based approach

Theorem 4.

Conjecture 2.

V-A Expansion of $\frac{1}{n}{\bf 1}_{R}^{T}{\bf L}^{m}{\bf 1}_{R}$

Lemma 2 (Concentration).

V-C Expansion of $\mathbb{E}\left\{{\frac{1}{n}{\bf 1}_{R}^{T}{\bf L}^{m}{\bf 1}_{R}}\right\}$

Lemma 3.

Lemma 4.

Lemma 5.

Lemma 6.

Lemma 7.

Lemma 8.

Lemma 9.

Lemma 10.

Definition 1.

Lemma 11.

Lemma 12.