An Analysis of the t-SNE Algorithm for Data Visualization
Sanjeev Arora, Wei Hu, Pravesh K. Kothari

TL;DR
This paper provides the first formal analysis and provable guarantees for the effectiveness of t-SNE in producing meaningful 2D visualizations of clusterable data, under certain conditions.
Contribution
It introduces a formal framework for data visualization, offers rigorous analysis of t-SNE's performance under deterministic and probabilistic cluster conditions, and extends understanding of its capabilities.
Findings
t-SNE successfully separates clusters under certain deterministic conditions.
Probabilistic models like mixtures of log-concave distributions satisfy these conditions.
t-SNE can partially recover cluster structure even without the deterministic assumptions.
Abstract
A first line of attack in exploratory data analysis is data visualization, i.e., generating a 2-dimensional representation of data that makes clusters of similar points visually identifiable. Standard Johnson-Lindenstrauss dimensionality reduction does not produce data visualizations. The t-SNE heuristic of van der Maaten and Hinton, which is based on non-convex optimization, has become the de facto standard for visualization in a wide range of applications. This work gives a formal framework for the problem of data visualization - finding a 2-dimensional embedding of clusterable data that correctly separates individual clusters to make them visually identifiable. We then give a rigorous analysis of the performance of t-SNE under a natural, deterministic condition on the "ground-truth" clusters (similar to conditions assumed in earlier analyses of clustering) in the underlying data.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopological and Geometric Data Analysis · Advanced Clustering Algorithms Research · Bayesian Methods and Mixture Models
