Clustering with t-SNE, provably

George C. Linderman; Stefan Steinerberger

arXiv:1706.02582·cs.LG·June 9, 2017

Clustering with t-SNE, provably

George C. Linderman, Stefan Steinerberger

PDF

2 Repos

TL;DR

This paper provides a mathematical proof that t-SNE can reliably recover well-separated clusters during its early exaggeration phase, enhancing understanding and guiding parameter choices.

Contribution

It offers the first rigorous analysis of t-SNE's cluster recovery capability and proposes new guidelines for setting key parameters based on the proof.

Findings

01

t-SNE can recover well-separated clusters during early exaggeration

02

New parameter setting rules improve embedding quality

03

Connection established between t-SNE and spectral clustering methods

Abstract

t-distributed Stochastic Neighborhood Embedding (t-SNE), a clustering and visualization method proposed by van der Maaten & Hinton in 2008, has rapidly become a standard tool in a number of natural sciences. Despite its overwhelming success, there is a distinct lack of mathematical foundations and the inner workings of the algorithm are not well understood. The purpose of this paper is to prove that t-SNE is able to recover well-separated clusters; more precisely, we prove that t-SNE in the `early exaggeration' phase, an optimization technique proposed by van der Maaten & Hinton (2008) and van der Maaten (2014), can be rigorously analyzed. As a byproduct, the proof suggests novel ways for setting the exaggeration parameter $α$ and step size $h$ . Numerical examples illustrate the effectiveness of these rules: in particular, the quality of embedding of topological structures (e.g.…

Equations119

C (Y) = K L (P ∣∣ Q) = i \neq = j \sum p_{ij} lo g \frac{p _{ij}}{q _{ij}},

C (Y) = K L (P ∣∣ Q) = i \neq = j \sum p_{ij} lo g \frac{p _{ij}}{q _{ij}},

\frac{\partial C}{\partial y _{i}} = 4 j \neq = i \sum (p_{ij} - q_{ij}) q_{ij} Z (y_{i} - y_{j}) .

\frac{\partial C}{\partial y _{i}} = 4 j \neq = i \sum (p_{ij} - q_{ij}) q_{ij} Z (y_{i} - y_{j}) .

y_{i} (t + 1) = y_{i} (t) - h \frac{\partial C}{\partial y _{i} ( t )},

y_{i} (t + 1) = y_{i} (t) - h \frac{\partial C}{\partial y _{i} ( t )},

α \sim \frac{n}{10} \mbox an d h \sim 1.

α \sim \frac{n}{10} \mbox an d h \sim 1.

κ \sim 1 - \frac{α h}{n} .

κ \sim 1 - \frac{α h}{n} .

α \sim \frac{n}{10} \mbox an d h \sim 1

α \sim \frac{n}{10} \mbox an d h \sim 1

p_{i ∣ j} = \frac{exp ( - ∥ x _{i} - x _{j} ∥ ^{2} /2 σ _{i}^{2} )}{\sum _{k \neq = i} exp ( - ∥ x _{i} - x _{k} ∥ ^{2} /2 σ _{i}^{2} )} \mbox an d p_{ij} = \frac{p _{i ∣ j} + p _{j ∣ i}}{2 n} .

p_{i ∣ j} = \frac{exp ( - ∥ x _{i} - x _{j} ∥ ^{2} /2 σ _{i}^{2} )}{\sum _{k \neq = i} exp ( - ∥ x _{i} - x _{k} ∥ ^{2} /2 σ _{i}^{2} )} \mbox an d p_{ij} = \frac{p _{i ∣ j} + p _{j ∣ i}}{2 n} .

q_{ij} = \frac{( 1 + ∥ y _{i} - y _{j} ∥ ^{2} ) ^{- 1}}{\sum _{k \neq = l} ( 1 + ∥ y _{k} - y _{l} ∥ ^{2} ) ^{- 1}} .

q_{ij} = \frac{( 1 + ∥ y _{i} - y _{j} ∥ ^{2} ) ^{- 1}}{\sum _{k \neq = l} ( 1 + ∥ y _{k} - y _{l} ∥ ^{2} ) ^{- 1}} .

C (Y) = K L (P ∣∣ Q) = i \neq = j \sum p_{ij} lo g \frac{p _{ij}}{q _{ij}} .

C (Y) = K L (P ∣∣ Q) = i \neq = j \sum p_{ij} lo g \frac{p _{ij}}{q _{ij}} .

\frac{\partial C}{\partial y _{i}} = 4 j \neq = i \sum (p_{ij} - q_{ij}) q_{ij} Z (y_{i} - y_{j}),

\frac{\partial C}{\partial y _{i}} = 4 j \neq = i \sum (p_{ij} - q_{ij}) q_{ij} Z (y_{i} - y_{j}),

Z = k \neq = l \sum (1 + ∥ y_{k} - y_{l} ∥^{2})^{- 1} .

Z = k \neq = l \sum (1 + ∥ y_{k} - y_{l} ∥^{2})^{- 1} .

\frac{1}{4} \frac{\partial C}{\partial y _{i}} = j \neq = i \sum p_{ij} q_{ij} Z (y_{i} - y_{j}) - j \neq = i \sum q_{ij}^{2} Z (y_{i} - y_{j})

\frac{1}{4} \frac{\partial C}{\partial y _{i}} = j \neq = i \sum p_{ij} q_{ij} Z (y_{i} - y_{j}) - j \neq = i \sum q_{ij}^{2} Z (y_{i} - y_{j})

\frac{1}{4} \frac{\partial C}{\partial y _{i}} = j \neq = i \sum α p_{ij} q_{ij} Z (y_{i} - y_{j}) - j \neq = i \sum q_{ij}^{2} Z (y_{i} - y_{j})

\frac{1}{4} \frac{\partial C}{\partial y _{i}} = j \neq = i \sum α p_{ij} q_{ij} Z (y_{i} - y_{j}) - j \neq = i \sum q_{ij}^{2} Z (y_{i} - y_{j})

\frac{h}{4} \frac{\partial C}{\partial y _{i}} = h j \neq = i \sum α p_{ij} q_{ij} Z (y_{i} - y_{j}) - h j \neq = i \sum q_{ij}^{2} Z (y_{i} - y_{j}) .

\frac{h}{4} \frac{\partial C}{\partial y _{i}} = h j \neq = i \sum α p_{ij} q_{ij} Z (y_{i} - y_{j}) - h j \neq = i \sum q_{ij}^{2} Z (y_{i} - y_{j}) .

p_{ij} \geq \frac{1}{10 n ∣ π ^{- 1} ( π ( i )) ∣} .

p_{ij} \geq \frac{1}{10 n ∣ π ^{- 1} ( π ( i )) ∣} .

\mbox s am ec l u s t er j \neq = i \sum := π ( j ) = π ( i ) j \neq = i \sum \mbox an d \mbox o t h er c l u s t er s j \neq = i \sum := π ( j ) \neq = π ( i ) j \neq = i \sum

\mbox s am ec l u s t er j \neq = i \sum := π ( j ) = π ( i ) j \neq = i \sum \mbox an d \mbox o t h er c l u s t er s j \neq = i \sum := π ( j ) \neq = π ( i ) j \neq = i \sum

\frac{1}{100} \leq α h \mbox s am ec l u s t er j \neq = i \sum p_{ij} \leq \frac{9}{10} .

\frac{1}{100} \leq α h \mbox s am ec l u s t er j \neq = i \sum p_{ij} \leq \frac{9}{10} .

α h = \frac{9}{10} \mbox s am ec l u s t er j \neq = i \sum p_{ij}^{- 1} \mbox w hi l e α h = \frac{9}{10} 1 \leq i \leq n max \mbox s am ec l u s t er j \neq = i \sum p_{ij}^{- 1}

α h = \frac{9}{10} \mbox s am ec l u s t er j \neq = i \sum p_{ij}^{- 1} \mbox w hi l e α h = \frac{9}{10} 1 \leq i \leq n max \mbox s am ec l u s t er j \neq = i \sum p_{ij}^{- 1}

diam {y_{j} : 1 \leq j \leq n \land π (j) = π (i)} \leq c \cdot h α \mbox o t h er c l u s t er s j \neq = i \sum p_{ij} + \frac{1}{n} .

diam {y_{j} : 1 \leq j \leq n \land π (j) = π (i)} \leq c \cdot h α \mbox o t h er c l u s t er s j \neq = i \sum p_{ij} + \frac{1}{n} .

\mbox o t h er c l u s t er s j \neq = i \sum p_{ij} \leq \frac{c _{2}}{n} .

\mbox o t h er c l u s t er s j \neq = i \sum p_{ij} \leq \frac{c _{2}}{n} .

p_{ij} \geq \frac{1}{10} \frac{1}{∣ π ^{- 1} ( π ( i )) ∣} \mbox an d j = 1 \sum n p_{ij} \leq 1.

p_{ij} \geq \frac{1}{10} \frac{1}{∣ π ^{- 1} ( π ( i )) ∣} \mbox an d j = 1 \sum n p_{ij} \leq 1.

y_{i} (t + 1)

y_{i} (t + 1)

= j \neq = i \sum p_{ij} y_{j} (t) + 1 - j \neq = i \sum p_{ij} y_{i} (t) .

A_{ij} = {1 - \sum_{i \neq = k} p_{ik} p_{j i} \mbox i f i = j \mbox o t h er w i se .

A_{ij} = {1 - \sum_{i \neq = k} p_{ik} p_{j i} \mbox i f i = j \mbox o t h er w i se .

z_{i} (t + 1)

z_{i} (t + 1)

z_{i} (0)

∣ α_{i, j, t} ∣ \geq δ > 0.

∣ α_{i, j, t} ∣ \geq δ > 0.

j = 1 \sum n α_{i, j, t} \leq 1.

j = 1 \sum n α_{i, j, t} \leq 1.

∥ ε_{i} (t) ∥ \leq ε .

∥ ε_{i} (t) ∥ \leq ε .

conv {z_{1} (t + 1), z_{2} (t + 1), \dots, z_{n} (t + 1)} \subseteq conv {z_{1} (t), z_{2} (t), \dots, z_{n} (t)} + B (0, ε),

conv {z_{1} (t + 1), z_{2} (t + 1), \dots, z_{n} (t + 1)} \subseteq conv {z_{1} (t), z_{2} (t), \dots, z_{n} (t)} + B (0, ε),

z_{i} (t + 1)

z_{i} (t + 1)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSpectral Clustering

Full text

Clustering with t-SNE, provably.

George C. Linderman

Program in Applied Mathematics, Yale University, New Haven, CT 06511, USA

[email protected]

and

Stefan Steinerberger

Department of Mathematics, Yale University, New Haven, CT 06511, USA

[email protected]

Abstract.

t-distributed Stochastic Neighborhood Embedding (t-SNE), a clustering and visualization method proposed by van der Maaten & Hinton in 2008, has rapidly become a standard tool in a number of natural sciences. Despite its overwhelming success, there is a distinct lack of mathematical foundations and the inner workings of the algorithm are not well understood. The purpose of this paper is to prove that t-SNE is able to recover well-separated clusters; more precisely, we prove that t-SNE in the ‘early exaggeration’ phase, an optimization technique proposed by van der Maaten & Hinton (2008) and van der Maaten (2014), can be rigorously analyzed. As a byproduct, the proof suggests novel ways for setting the exaggeration parameter $\alpha$ and step size $h$ . Numerical examples illustrate the effectiveness of these rules: in particular, the quality of embedding of topological structures (e.g. the swiss roll) improves. We also discuss a connection to spectral clustering methods.

GCL was supported by NIH grant #1R01HG008383-01A1 (PI: Yuval Kluger) and U.S. NIH MSTP Training Grant T32GM007205. SS was partially supported by #INO15-00038 (Institute of New Economic Thinking).

1. Introduction and main result

1.1. Introduction.

The analysis of large, high dimensional datasets is ubiquitous in an increasing number of fields and vital to their progress. Traditional approaches to data analysis and visualization often fail in the high dimensional setting, and it is common to perform dimensionality reduction in order to make data analysis tractable. t-distributed Stochastic Neighborhood Embedding (t-SNE), introduced by van der Maaten and Hinton (2008), is an impressively effective non-linear dimensionality reduction technique that has recently found enormous popularity in several fields. It is most commonly used to produce a two-dimensional embedding of high dimensional data with the goal of simplifying the identification of clusters. Despite its tremendous empirical success, the theory underlying t-SNE is unclear. The only theoretical paper at this point is Shaham and Steinerberger (2017), which shows that the structure of the loss functional of SNE (a precursor to t-SNE) implies that global minimizers separate clusters in a quantitative sense.

1.2. A case study.

As an unsupervised learning method, t-SNE is commonly used to visualize high dimensional data and provide crucial intuition in settings where ground truth is unknown. The analysis of single cell RNA sequencing (scRNA-seq) data, where t-SNE has become an integral part of the standard analysis pipeline, provides a relevant example of its usage.

Figure 1 shows (left) the output of running t-SNE on the 30 largest principal components of the normalized expression matrix of 49300 retinal cells taken from Macosko et al. (2015). The output on the right has cells colored based on which of 12 cell type marker genes were most expressed (with grey signifying that none of the marker genes were expressed). This example is well suited to showcase both the tremendous impact of t-SNE in the medical sciences as well as the inherent difficulties of interpreting its output when ground truth is unknown: how many clusters are in the original space, and do they correspond one-to-one to clusters in the t-SNE plot? Do the clusters (e.g. the largest cluster that does not express any marker genes) have substructure that is not apparent in this visualization? Pre-processing steps will yield different embeddings; how stable are the clusters? All these questions are of the utmost importance and underline the need for a better theoretical understanding.

1.3. Early Exaggeration

t-SNE (described in greater detail in §3) minimizes the Kullback-Leibler divergence between a Gaussian distribution modeling distances between points in the high dimensional input space and a Student t-distribution modeling distances between corresponding points in a low dimensional embedding. Given a $d$ -dimensional input dataset $\mathcal{X}=\{x_{1},x_{2},...,x_{n}\}\subset\mathbb{R}^{d}$ , t-SNE computes an $s$ -dimensional embedding of the points in $\mathcal{X}$ , denoted by $\mathcal{Y}=\{y_{1},y_{2},...,y_{n}\}\subset\mathbb{R}^{s}$ , where $s\ll d$ and most commonly $s=2\text{ or }3$ . The main idea is to define a series of affinities $p_{ij}$ on $\mathcal{X}$ as well as a series of affinities $q_{ij}$ in the embedding $\mathcal{Y}$ and then try minimize the distance of these distributions in the Kullback-Leibler distance

[TABLE]

which gives rise to a gradient descent method via

[TABLE]

One difficulty is that the convergence rate slows down as the number of points $n$ increases. However, already the original paper van der Maaten and Hinton (2008) proposes a number of ways in which the convergence can be accelerated.

A less obvious way to improve the optimization, which we call ‘early exaggeration’, is to multiply all of the $p_{ij}$ ’s by, for example, 4, in the initial stages of the optimization. […] In all the visualizations presented in this paper and in the supporting material, we used exactly the same optimization procedure. We used the early exaggeration method with an exaggeration of 4 for the first 50 iterations (note that early exaggeration is not included in the pseudocode in Algorithm 1). (from: van der Maaten and Hinton (2008))

It is easy to test empirically that this renormalization indeed improves the clustering and is effective. It has become completely standard and is hard-coded into the very widely used standard implementation available online, as described by van der Maaten (2014):

During the first 250 learning iterations, we multiplied all $p_{ij}-$ values by a user-defined constant $\alpha>1$ . […] this trick enables t-SNE to find a better global structure in the early stages of the optimization by creating very tight clusters of points that can easily move around in the embedding space. In preliminary experiments, we found that this trick becomes increasingly important to obtain good embeddings when the data set size increases [Emphasis GL & SS], as it becomes harder for the optimization to find a good global structure when there are more points in the embedding because there is less space for clusters to move around. In our experiments, we fix $\alpha=12$ (by contrast, van der Maaten and Hinton (2008) used $\alpha=4$ ). (from: van der Maaten (2014))

As it turns out, this simple optimization trick can be rigorously analyzed. As a byproduct of our analysis, we see that the convergence of non-accelerated t-SNE will slow down as the number of points $n$ increases and the number of iterations required will grow at least linearly in $n$ . The implementation available online counteracts this problem by various methods: (1) the early exaggeration factor $\alpha$ , (2) a large ( $h=200$ ) stepsize in the gradient descent

[TABLE]

and by (3) optimization techniques such as momentum. We only deal with the t-SNE algorithm, the early exaggeration factor $\alpha$ and the step-size $h$ ; one of the main points of our paper is that a suitable parameter selection of $\alpha$ and $h$ makes it possible to guarantee fast convergence without additional optimization techniques.

1.4. Summary of Main Results.

We will now state our main results at an informal level; all the statements can be made precise (this is done in §3.2) and will be rigorously proven.

(1)

Canonical parameters and exponential convergence. There is a canonical setting for the parameters $\alpha,h$ for which the algorithm applied to clustered data converges provably at an exponential rate without the use of other optimization techniques (such as momentum). This setting is

[TABLE]

These parameters lead to an exponential convergence of all embedded clusters to small balls (whose diameter depends on how well $\mathcal{X}$ is clustered). Generally, the speed of convergence is exponential with an exponential factor $\kappa$

[TABLE]

Moreover, if $\alpha h\gtrsim n$ , then convergence of the algorithm breaks down. This theoretical result is actually applicable to the early exaggeration phase of the classical t-SNE implementation as long as the number of points is not too large (roughly, $n\lesssim 20000$ ). 2. (2)

Spectral clustering. The t-SNE algorithm, in this regime, behaves like a spectral clustering algorithm; moreover, this algorithm can be written down explicitly. This allows for (1) the use of theory from spectral clustering to rigorously analyze t-SNE and (2) a fast implementation that can perform the early exaggeration phase in a fraction of the time necessary to run t-SNE (in this regime). It also poses the challenge of trying to understand whether t-SNE behaves qualitatively different for the standard parameters $\alpha\sim 12,h\sim 200$ or whether it behaves more or less identically (and thus like a spectral method). 3. (3)

Disjoint clusters. It is not guaranteed that the embedded clusters in $\mathcal{Y}$ are disjoint; but given a random initialization, it is extremely unlikely that two distinct clusters will converge to the same center. Furthermore, if $\mathcal{X}$ is well-clustered, the diameter of the clusters $\mathcal{Y}$ can be made even smaller by decreasing the step-size $h$ and further increasing $\alpha$ as long as the product satisfies $\alpha h\sim n/10$ . Increasing $\alpha$ will resolve overlapping clusters, as long as they have different centers. In particular, the number of disjoint clusters in $\mathcal{Y}$ is a lower bound on the number of clusters in $\mathcal{X}$ (and, generically, the numbers coincide). 4. (4)

Independence of initialization. All these results are independent of the initialization of $\mathcal{Y}$ as long as it is contained in a sufficiently small ball.

An immediate implication of (3) is the following: if we are given some clustered data $\mathcal{X}$ and see that the embedding of t-SNE for large values of $\alpha$ (and small values of $h$ ) produces $k$ clusters, then there are exactly $k$ clusters in $\mathcal{X}$ . The results guarantee that all clusters in $\mathcal{X}$ are eventually mapped to small balls which can be made arbitrarily small. We see that when parameters are chosen optimally, this result provides a justification for the way t-SNE is commonly used in, say, biomedical research.

1.5. Approximating Spectral Clustering.

The fact that t-SNE approximates a spectral clustering method for $\alpha\sim n/10,~{}h\sim 1$ raises a fascinating question: does t-SNE, in its early exaggeration phase, perform better with the classical parameter choices of $\alpha\sim 12,h\sim 200$ than it does with $\alpha\sim n/10,~{}h\sim 1$ ? If yes, then its inner workings may give rise to improved spectral methods. If no, then it would be advantageous to use $\alpha\sim n/10,~{}h\sim 1$ , which then, however, is essentially a spectral method and it may be advantageous (and much faster) to initialize the second phase of t-SNE by using the outcome of a more advanced spectral method as initialization instead. We discuss some experiments in that direction in §4.1 and believe this to be worthy of further investigation. Moreover, we describe a visualization technique in the style of t-SNE for spectral clustering tools (see §4.2).

1.6. Organization.

The Organization of this paper is as follows: we first illustrate our main points with some numerical examples in Section §2. Section §3 establishes notation and a formal statement of our main result, Section §4 derives a connection between t-SNE and spectral clustering, Section §5 discusses a certain type of discrete dynamical system on finite numbers of points and establishes a crucial estimate, Section §6 gives a proof of the main result.

2. Numerical examples

This section discusses a number of numerical examples to illustrate our main points.

2.1. Lines and Swiss roll.

It is classical that t-SNE does not successfully embed the swiss roll; however, the random initialization causes difficulty even on simpler data: Figure 2 shows the t-SNE embedding (using Matlab implementation of van der Maaten (2014) with default parameters) of a simple line in $\mathbb{R}^{3}$ .

The randomized initialization causes, after initial contraction in the early exaggeration phase, a topological interlocking that cannot be further resolved. The example is even more striking with the swiss roll, where the random initialization leads to ‘knots’ that cannot be untied by t-SNE. In stark contrast, the parameter selection

[TABLE]

allows for a more effective early exaggeration phase that clearly recovers the line from random initial data and even contracts the swiss roll to a correctly ordered line (that would then expand in the second phase of the algorithm).

The successful embedding of these examples when $\alpha$ and $h$ are chosen optimally is consistent with our claim that in this regime, the early exaggeration phase of t-SNE acts like a spectral method, many of which also correctly embed these manifolds.

2.2. Real-life data.

Finally, we show the impact of the parameter selection $\alpha\sim n/10,h\sim 1$ in a real-life example. Figure 5 shows (left) classical out-of-the-box t-SNE on 10000 randomly subsampled handwritten digits (0–5) from the MNIST dataset as well as the outcome of the early exaggeration phase of t-SNE with parameters $\alpha\sim n/10,h\sim 1$ (middle) and the final outcome after the second phase of t-SNE has been initialized with the data shown in the middle (right). We see that early exaggeration does essentially all the clustering already and the second phase rearranges them.

We believe this example again hints at one of the fundamental questions that arises from the work in this paper: is the initial clustering done by standard t-SNE comparable to the initial clustering with the new parameter selection? If so, then the fact that the new parameter selection emulates a spectral clustering method (see §4) certainly suggests the option of initializing with other clustering methods as opposed to random initialization. Moreover, it would hint at the danger of using a spectral clustering method and t-SNE as a dual verification of clustering.

3. t-SNE: Notation and the Main result

3.1. t-SNE

We denote the $d$ -dimensional input dataset by $\mathcal{X}=\{x_{1},x_{2},...,x_{n}\}\subset\mathbb{R}^{d}$ , t-SNE computes an $s$ -dimensional embedding of the points in $\mathcal{X}$ , denoted by $\mathcal{Y}=\{y_{1},y_{2},...,y_{n}\}\subset\mathbb{R}^{s}$ , where $s\ll d$ and most commonly $s=2\text{ or }3$ . The joint probability $p_{ij}$ measuring the similarity between $x_{i}$ and $x_{j}$ is computed as:

[TABLE]

The bandwidth of the Gaussian kernel, $\sigma_{i}$ , is often chosen such that the perplexity of $P_{i}$ matches a user defined value, where $P_{i}$ is the conditional distribution across all data points given $x_{i}$ . We will never deal with these issues: we will assume that the $p_{ij}$ are given and that they correspond to a well-clustered set $\mathcal{X}$ (in a precise sense defined below). In particular, we will not assume that they have been obtained using a Gaussian kernel. The similarity between points $y_{i}$ and $y_{j}$ in the low dimensional embedding is defined as:

[TABLE]

t-SNE finds the points $\{y_{1},\dots,y_{n}\}$ which minimize the Kullback-Leibler divergence between the joint distribution $P$ of points in the input space and the joint distribution $Q$ of points in the embedding space:

[TABLE]

The points $\mathcal{Y}$ are initialized randomly, and the cost function $C(\mathcal{Y})$ is minimized using gradient descent. The gradient is derived in Appendix A of van der Maaten and Hinton (2008):

[TABLE]

where $Z$ is a global normalization constant

[TABLE]

As in van der Maaten (2014), we split the gradient into two parts:

[TABLE]

where the first and second sums correspond to the sum of all attractive forces and the sum of all repulsive forces, respectively. Early exaggeration introduces the coefficient $\alpha>1$ and corresponds to the gradient descent method

[TABLE]

and a small step-size $h>0$ leads to the expression

[TABLE]

3.2. Main result

This section gives our main result. We emphasize that the method of proof is rather flexible and it is not difficult to obtain variations on the result under slightly different assumptions. We emphasize that our result is formally stated for a set of points $\left\{x_{1},\dots,x_{n}\right\}$ and a set of mutual affinities $p_{ij}.$ We will not assume that the $p_{ij}$ are obtained using the standard t-SNE normalizations but work at a full level of generality using a set of three assumptions. We note, and explain below, that for standard t-SNE the second assumption holds until the number of points exceeds, roughly, $n\sim 20000$ and the third assumption holds by design. The first assumption encapsulates our notion of clustered data.

1. $\mathcal{X}$ is clustered. We proceed by giving a very versatile definition of what it means to be a cluster; it is trivially applicable to things that clearly are not clusters, however, in those cases the error bound in the Theorem will not convey any information. Formally, we assume that there exists a $k\in\mathbb{N}$ (the number of clusters) and a map $\pi:\left\{1,\dots,n\right\}\rightarrow\left\{1,2,\dots,k\right\}$ assigning each point to one of the $k$ clusters such that the following property holds: if $\pi(x_{i})=\pi(x_{j}),$ then

[TABLE]

Observe that $|\pi^{-1}(\pi(i))|$ is merely the size of the cluster in which $i$ and $j$ lie. We will furthermore abbreviate, for fixed $1\leq i\leq n$ , summations over clusters as

[TABLE]

2. Parameter choice. We assume that $\alpha$ and $h$ are chosen such that, for some $1\leq i\leq n$

[TABLE]

The main result will be applicable to single cluster (i.e. it is possible to guarantee that a single cluster converges even if the rest does not) and it can be applied to exactly those clusters satisfying this inequality. It is easy to see, both in the proof and in numerical experiments, that the upper bound is a necessary condition for the early exaggeration phase of t-SNE to work (more precisely, the upper bound 1 is necessary but we need a little bit of leeway in another part of the argument). We observe that condition (1) implies that $\alpha\sim n/10$ and $h\sim 1$ is admissible, however, other parameter choices (i.e. $\alpha\sim 10n,h\sim 1/100$ ) are equally valid. In particular, for a small number of points (roughly $n\lesssim 24000$ ), the standard t-SNE parameter selection $\alpha\sim 12,~{}h\sim 200$ does satisfy these bounds. If the number of points gets larger, the lower bound is violated: our main result can be easily extended to cover that case, however, the factor $\kappa$ with which exponential convergence occurs approaches 1 and convergence, while technically exponential, slows down. In particular, an analysis of how this condition acts in the proof motivates an accurate parameter selection rule.

Guideline. The best convergence rate for the cluster containing $y_{i}$ is attained when

[TABLE]

is the best selection to ensure that all clusters converge.

3. Localized initialization. The initialization satisfies $\mathcal{Y}\subset[-0.01,0.01]^{2}$ . This assumption is not crucial and could be easily modified at the cost of changing some other constants. The proof suggests that initializing at smaller scales might be beneficial on the level of constants.

Theorem.

The diameter of the embedded cluster $\left\{y_{j}:1\leq j\leq n\wedge\pi(j)=\pi(i)\right\}$ decays exponentially (at universal rate) until its diameter satisfies, for some universal constant $c>0$ ,

[TABLE]

Remarks.

(1)

The Theorem can be applied to a single cluster; in particular, some clusters may contract to tiny balls while others do not contract at all. 2. (2)

Since $\alpha h\sim n$ , we see that the bound is only nontrivial if, for some small constant $c_{2}>0$ ,

[TABLE]

Otherwise, it merely tells us that the elements of the clusters are contained in a ball of radius $\sim 1$ (as are all the other points). Generally, for well-clustered data, we would expect that sum to be very close to 0 which would yield a leading term error of $ch/n$ . 3. (3)

The constant $c$ seems to be roughly on scale $c\sim 10$ for well-clustered data and slightly larger for data with worse clustering properties (in particular, for the classical t-SNE parameter section, it would slowly increase with the number of points $n$ ). We believe this estimate to conservative and consider the true value to be on a smaller order of magnitude; this question will be pursued in future work.

The proof of the main result is actually rather versatile and should easily adapt to a variety of other settings that might be of interest. This versatility is partly due to the connection of the argument to rather fundamental ideas in partial differential equations, indeed, the argument may be interpreted as a maximum principle for a discrete parabolic operator acting on vector-valued (i.e. points in space) data. This interpretation is what led us to establish a connection to spectral clustering which we now discuss.

4. A Connection to Spectral Clustering

4.1. Approximating spectral clustering

The purpose of this section is to note that it is possible to take the limit $\alpha\rightarrow\infty,~{}h\rightarrow 0$ (scaled so that $\alpha\cdot h=\mbox{const}$ ) and that, in that limit, one obtains a simple spectral clustering method. We re-introduce notation and assume again that $\mathcal{X}=\left\{x_{1},x_{2},\dots,x_{n}\right\}\subset\mathbb{R}^{d}$ is given. We assume $p_{ij}$ is some collection of affinities scaled in such a way that for $x_{i},x_{j}$ in the same cluster $\pi(i)=\pi(j)$

[TABLE]

We observe that this scaling is slightly differently than the one above: it is obtained by absorbing the $\alpha h\sim n$ term into the affinities. At the same time, $h\rightarrow 0$ implies that the repulsion term containing the $q_{ij}$ does not exert any force. This implies that, in the limit, the remaining term in the gradient descent method is given by

[TABLE]

This, however, can be interpreted as a Markov chain with suitably chosen transition probabilities. It may be unusual, at first, to see this equation since the $y_{i}(t)$ are vectors in $\mathbb{R}^{2}$ , however, all the equations separate different coordinates, which allows for a reduction to the familiar form. All the canonical results from spectral clustering apply: the asymptotic behavior is given by the largest non-trivial eigenvalue(s), which are either 1 (in the case of perfectly separated clusters) or very close to 1 and convergence speed depends on the spectral gap.

4.2. Visualizing spectral clustering

The connection also allows us to go the other direction and discuss a particular visualization technique for spectral methods that shows arising clusters as points in $\mathbb{R}^{2}$ (or higher dimensions, which is not essential here). The transition matrix of the Markov chain is given by

[TABLE]

The large-time behavior of $y(t)=A^{t}y(0)$ is essentially determined by the spectrum of $A$ close to 1. Moreover, in the case of perfect clustering with $p_{ij}=0$ whenever $x_{i}$ and $x_{j}$ are in different clusters, there are exactly $k$ eigenvalues equal to 1 and the initialization converges to that.

Let us now assume that the goal is visualization in $\mathbb{R}^{2}$ . We let $\mathcal{Y}=\left\{y_{1}(0),y_{2}(0),\dots,y_{n}(0)\right\}\subset\mathbb{R}^{2}$ be a set of points that we assume are i.i.d. random variables from, say, the uniform distribution on $[-0.01,0.01]^{2}$ . We propose to visualize the point set after $k$ iterations as follows: collect these $n$ initial vectors in a $n\times 2$ vector $\underline{y}$ and interpret the $n$ rows of $A^{k}\underline{y}$ as coordinates in $\mathbb{R}^{2}$ . This creates t-SNE-style visualizations for spectral methods (see Fig. 6 and Fig. 7, lower rows).

Examples. An example of this method is shown in Figure 6. The example is comprised of 40000 points in $\mathbb{R}^{25}$ sampled from four very narrow Gaussians and are highly clustered. We used perplexity of 30 to create the $p_{ij}$ and used $\alpha=n/10,h=1$ in the implementation of t-SNE. The second row in Figure 6 shows the projection onto the 50 largest eigenvectors of $A$ . The computation time of t-SNE took roughly 7 minutes vs. 1 minute for the spectral decomposition – note, however, that once the spectral decomposition has been computed, then iterations can be computed in constant time (one only has to raise the eigenvalues to some power).

Another example is given in Figure 7 that is run on 4 digits in MNIST; again, both methods coincide. This shows that our derivation of the approximating spectral method was accurate. At the same time, it suggests to repeat the fundamental question.

Open problem. Is the clustering behavior of the early exaggeration phase of t-SNE with $\alpha=12,h=200$ (and possibly optimization techniques such as momentum) essentially qualitatively equivalent to the behavior of t-SNE with our parameter choice $\alpha=n/10,h=1$ ?

If this were indeed the case, then the early exaggeration phase of t-SNE would be simply a spectral clustering method in disguise; if not, then it would be very valuable to understand under which circumstances its performance is superior to spectral clustering and whether its underlying mechanisms could be used to boost spectral methods. We reiterate that we believe this to be a very interesting problem.

5. Ingredients for the Proof: Discretized Dynamical Systems

This section introduces a type of discrete dynamical systems on sets of points in $\mathbb{R}^{s}$ and we describe their asymptotic behavior; this is a self-contained result; it could potentially be interpreted as an analysis of a spectral method that is robust to small error terms but the analysis is simple enough for us to keep entirely self-contained. Our original guiding picture was that of the maximum principle in the theory of parabolic partial differential equations.

5.1. A discrete dynamical system

Let $z_{1},\dots,z_{n}\in\mathbb{R}^{s}$ be given. We use them as initial values for a time-discrete dynamical system that is defined via

[TABLE]

At this stage, if the points are in general position and $n\geq s$ , basic linear algebra implies that this system can undergo almost any arbitrary evolution as long as one is free to choose $\alpha_{i,j,t}$ . We will henceforth assume that these parameters assume the following three conditions.

(1)

There is a uniform lower bound on the coefficients for all $t>0$ and all $i\neq j$

[TABLE] 2. (2)

There is a uniform upper bound on the coefficients

[TABLE] 3. (3)

There is a uniform upper bound on the error term

[TABLE]

A typical example of such a dynamical system is given in the Figure below: we start with twelve points on the unit circle and then iterate the system for some random choices of $a_{i,j,t}$ and random $\varepsilon_{i}(t)$ . The points move at first towards each other until they are close and the error term starts being on the same scale as the forces of attraction. The points then move around randomly (all the while staying close to each other). We will make this intuitive picture precise below.

The main result of this section is that all the points in this dynamical system are eventually contained in a ball whose size only depends on $n,\delta$ and $\varepsilon$ . We start by showing that the convex hull of the points is stable. We use $B(0,\varepsilon)$ to denote a ball of radius $\varepsilon$ , $A+B=\left\{a+b:a\in A\wedge b\in B\right\}$ and $\operatorname{conv}{A}$ for the convex hull of $A$ .

Lemma 1 (Stability of the convex hull).

With the assumptions above, we have

[TABLE]

Proof.

This argument is simple. We note that

[TABLE]

By assumption,

[TABLE]

and this implies $z_{i}(t+1)-\varepsilon_{i}(t)\in\operatorname{conv}\left\{z_{1}(t),z_{2}(t),\dots,z_{n}(t)\right\}$ . ∎

Lemma 2 (Contraction inequality).

With the notation above, if the diameter is large

[TABLE]

then

[TABLE]

One particularly important consequence is the following: the diameter shrinks, at an exponential rate $\left(1-n\delta/20\right)^{t}$ , to a size of $\sim\varepsilon/(n\delta)$ . Of course, this convergence is particularly fast whenever $n\delta\sim 1$ . It is easy to see, for example by taking $n=2$ points in $\mathbb{R}$ , that this is the optimal scale for the result to hold.

Proof of Lemma 2.

The method of proof will be as follows: we will project the set of points onto an arbitrary line (say, the $x-$ axis by taking only the first coordinate of each point) and show that the one-dimensional projections contract exponentially quickly. This then implies the desired statement. Let $\pi_{x}:\mathbb{R}^{n}\rightarrow\mathbb{R}$ be such a projection. We abbreviate the diameter of the projection as

[TABLE]

We may assume w.l.o.g. that this set is contained in $\left\{\pi_{x}z_{1}(t),\pi_{x}z_{2}(t),\dots,\pi_{x}z_{n}(t)\right\}\subset[0,\operatorname{diam}]$ . We then subdivide the interval into two regions

[TABLE]

and denote the number of points in each interval by $i_{1},i_{2}$ . Clearly, $i_{1}+i_{2}=n$ and therefore either $i_{1}\geq n/2$ or $i_{2}\geq n/2$ . We assume w.l.o.g. the first case holds. Projections are linear, thus

[TABLE]

We abbreviate

[TABLE]

and write

[TABLE]

Moreover, using the lower bound $a_{i,j,t}\geq\delta$

[TABLE]

Then, however,

[TABLE]

which shows that $\pi_{x}z_{i}(t+1)\in[0,\operatorname{diam}(1-n\delta/4)].$ Accounting for the error term, we get

[TABLE]

If the diameter is indeed disproportionately large

[TABLE]

then this can be rearranged as

[TABLE]

and therefore

[TABLE]

Since this is true in every projection, it also holds for the diameter of the original set. ∎

Remark. The argument could be slightly improved because in its current form it assumes that the error $\varepsilon_{i}$ has $\|\varepsilon_{i}(t)\|_{\ell^{\infty}}=\varepsilon$ , while we assume $\|\varepsilon_{i}(t)\|_{\ell^{\infty}}\leq\|\varepsilon_{i}(t)\|_{\ell^{2}}=\varepsilon$ . This, together with the usual other optimization schemes, should yield an improved estimate on the constant. The condition on $\delta$ could also be weakened (at the cost of losing constants). In particular, it would be sufficient in Assumption (1) in our main result to assume that, for every $1\leq i\leq n$

[TABLE]

that are in the same cluster $\pi(z_{i})=\pi(z_{j})$ . By adapting the proof, the constant $(1/2+\varepsilon)$ could be reduced further, however, this is inevitably going to decrease the provable bounds on the exponential decay rate (which is not an artifact of the method, convergence will slow down).

6. Proof of the Main Result

The rough outline of the argument is as follows: we initialize all points inside $[-0.01,0.01]^{2}$ . We rewrite the gradient descent method acting on one particular embedded cluster as a dynamical system of the type studied above with an error term. The error term contains $q_{ij}$ , which depend on distances between points from different clusters. This is difficult to control, especially if the points are far apart. Our strategy will now be as follows: we show that the $q_{ij}$ are all under control as long as everything is contained in $[-0.02,0.02]^{2}$ . We use stability of the convex hull to guarantee that all of the embedded points are within $[-0.02,0.02]^{2}$ for at least $\ell$ iterations and show that this time-scale is enough to guarantee contraction of the cluster.

Proof.

We start by showing that the $q_{ij}$ are comparable as long as the point set is contained in a small region space. Let now $\left\{y_{1},y_{2},\dots,y_{n}\right\}\subset[-0.02,0.02]^{2}$ and recall the definitions

[TABLE]

Then, however, it is easy to see that $0\leq\|y_{i}-y_{j}\|\leq 0.06$ implies

[TABLE]

We will now restrict ourselves to a small embedded cluster $\left\{y_{i}:\pi(i)~{}\mbox{fixed}\right\}$ and rewrite the gradient descent method as

[TABLE]

where the first sum is yielding the main contribution and the other two sums are treated as a small error. Applying our results for dynamical systems of this type requires us to verify the conditions. We start by showing the conditions on the coefficients to be valid. Clearly,

[TABLE]

which is clearly admissible whenever $\alpha h\sim n$ . As for the upper bound, it is easy to see that

[TABLE]

It remains to study the size of the error term for which we use the triangle inequality

[TABLE]

and, similarly for the second term,

[TABLE]

This tells us that the norm of the error term is bounded by

[TABLE]

It remains to check whether time-scales fit. The number of iterations $\ell$ for which the assumption $\mathcal{Y}\subset[-0.02,0.02]^{2}$ is reasonable is at least $\ell\geq 0.01/\varepsilon.$ At the same time, the contraction inequality implies that in that time the cluster shrinks to size

[TABLE]

where the last inequality follows from the elementary inequality

[TABLE]

∎

Remarks. The proof is relatively flexible in several different spots. By demanding that the initialization $\mathcal{Y}$ is contained in a sufficiently small ball, one can force the quantity $Zq_{ij}$ to be arbitrarily close to 1. We also emphasize that we did not optimize over constants and additional fine-tuning in various spots would yield better constants (at the cost of a more involved argument which is why we decided against it). The use of the triangle inequality in bounding the error terms is another part of the proof that deserves attention: if the clusters are spread out, then we would expect the repulsive forces to act from all directions and lead to additional cancellation (which, if it were indeed the case that the $q_{ij}$ do not play a significant role in the clustering that occurs in the early exaggeration phase, would be an additional reason for the strong similarity to the outcome of the spectral method). It could be of interest to study mean-field-type approximations to gain a better understanding of this phenomenon.

Bibliography4

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1van der Maaten (2014) Laurens van der Maaten. Accelerating t-sne using tree-based algorithms. Journal of Machine Learning Research , 15(1):3221–3245, 2014.
2van der Maaten and Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning Research , 9(Nov):2579–2605, 2008.
3Macosko et al. (2015) Evan Z Macosko, Anindita Basu, Rahul Satija, James Nemesh, Karthik Shekhar, Melissa Goldman, Itay Tirosh, Allison R Bialas, Nolan Kamitaki, Emily M Martersteck, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell , 161(5):1202–1214, 2015.
4Shaham and Steinerberger (2017) Uri Shaham and Stefan Steinerberger. Stochastic Neighbor Embedding separates well-separated clusters. preprint at ar Xiv:1702.02670

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

Clustering with t-SNE, provably.

Abstract.

1. Introduction and main result

1.1. Introduction.

1.2. A case study.

1.3. Early Exaggeration

1.4. Summary of Main Results.

1.5. Approximating Spectral Clustering.

1.6. Organization.

2. Numerical examples

2.1. Lines and Swiss roll.

2.2. Real-life data.

3. t-SNE: Notation and the Main result

3.1. t-SNE

3.2. Main result

Theorem**.**

4. A Connection to Spectral Clustering

4.1. Approximating spectral clustering

4.2. Visualizing spectral clustering

5. Ingredients for the Proof: Discretized Dynamical Systems

5.1. A discrete dynamical system

Lemma 1** (Stability of the convex hull).**

Proof.

Lemma 2** (Contraction inequality).**

Proof of Lemma 2.

6. Proof of the Main Result

Proof.

Theorem.

Lemma 1 (Stability of the convex hull).

Lemma 2 (Contraction inequality).