Beyond Explained Variance: A Cautionary Tale of PCA
Gionni Marchetti

TL;DR
This paper critiques PCA for visualizing nonlinear data, demonstrating that alternative methods like t-SNE and persistent homology reveal a ring structure in fossil teeth data, challenging previous clustering interpretations.
Contribution
It introduces a combined analysis using t-SNE and persistent homology to better understand nonlinear data structures and proposes a probabilistic-geometric model supporting these findings.
Findings
PCA shows clustering, but t-SNE and PH reveal a ring structure.
The data likely lie on a one-dimensional manifold, a circle.
Pairwise cosine distances follow an arcsine distribution, supporting the geometric model.
Abstract
We address shortcomings of principal component analysis (PCA) for visualizing high-dimensional data lying on a nonlinear low-dimensional manifold via two-dimensional scatterplots, focusing on a fossil teeth dataset from the early mammalian insectivore Kuehneotherium. While the PCA scatterplot reported by Jolliffe and Cadima (Philosophical Transactions of the Royal Society A, 2016) shows clustering in the region where PC2 < 0, our analysis based on t-SNE and persistent homology (PH) reveals a ring-like structure with no evident clustering and intrinsic dimensionality equal to one. We further propose a generative probabilistic-geometric model in which the data are sampled uniformly from a unit circle. Under this model, pairwise cosine distances follow an arcsine distribution, in qualitative agreement with the observed U-shaped distribution, thereby independently supporting the analysis…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
