A Theory of Taxonomy
Guido D'Amico, Raul Rabadan, Matthew Kleban

TL;DR
This paper introduces a universal branching model that explains the distribution of items across categories in large taxonomies, with applications spanning ecology, computer science, and library sciences.
Contribution
It proposes a simple, non-parametric model that reproduces observed abundance distributions in diverse real-world datasets, revealing underlying commonalities.
Findings
The model accurately fits data from NYC transit, libraries, and microbiomes.
It predicts unrepresented categories in finite samples.
A universal pattern in taxonomic abundance distributions is identified.
Abstract
A taxonomy is a standardized framework to classify and organize items into categories. Hierarchical taxonomies are ubiquitous, ranging from the classification of organisms to the file system on a computer. Characterizing the typical distribution of items within taxonomic categories is an important question with applications in many disciplines. Ecologists have long sought to account for the patterns observed in species-abundance distributions (the number of individuals per species found in some sample), and computer scientists study the distribution of files per directory. Is there a universal statistical distribution describing how many items are typically found in each category in large taxonomies? Here, we analyze a wide array of large, real-world datasets -- including items lost and found on the New York City transit system, library books, and a bacterial microbiome -- and discover…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplex Network Analysis Techniques · Plant and animal studies · Genomics and Phylogenetic Studies
