Topological Data Analysis on Simple English Wikipedia Articles

Matthew Wright; Xiaojun Zheng

arXiv:2007.00063·math.AT·December 14, 2020

Topological Data Analysis on Simple English Wikipedia Articles

Matthew Wright, Xiaojun Zheng

PDF

Open Access

TL;DR

This paper introduces three statistical methods for analyzing geometric data using two-parameter persistent homology, applied to Wikipedia articles, enabling better data comparison and understanding of data stability.

Contribution

It develops novel statistical techniques for two-parameter persistent homology and demonstrates their application to real-world Wikipedia data analysis.

Findings

01

Methods can distinguish data subsets effectively

02

Approaches help compare data with random models

03

Insights into null distributions and noise stability

Abstract

Single-parameter persistent homology, a key tool in topological data analysis, has been widely applied to data problems along with statistical techniques that quantify the significance of the results. In contrast, statistical techniques for two-parameter persistence, while highly desirable for real-world applications, have scarcely been considered. We present three statistical approaches for comparing geometric data using two-parameter persistent homology; these approaches rely on the Hilbert function, matching distance, and barcodes obtained from two-parameter persistence modules computed from the point-cloud data. Our statistical methods are broadly applicable for analysis of geometric data indexed by a real-valued parameter. We apply these approaches to analyze high-dimensional point-cloud data obtained from Simple English Wikipedia articles. In particular, we show how our methods…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopological and Geometric Data Analysis