Towards Evaluation of Cultural-scale Claims in Light of Topic Model Sampling Effects
Jaimie Murdock, Jiaan Zeng, Colin Allen

TL;DR
This study investigates how sampling affects the stability of topic models in large text collections, revealing that larger samples produce more consistent and overlapping topics, which can inform cultural and socio-linguistic analyses.
Contribution
It introduces a method to evaluate the sensitivity of topic models to sampling effects across different library classifications, highlighting how sample size influences model stability and interpretability.
Findings
Larger samples yield more stable topic models.
Alignment distance decreases with increased sample size.
Topic overlap increases as sample size grows.
Abstract
Cultural-scale models of full text documents are prone to over-interpretation by researchers making unintentionally strong socio-linguistic claims (Pechenick et al., 2015) without recognizing that even large digital libraries are merely samples of all the books ever produced. In this study, we test the sensitivity of the topic models to the sampling process by taking random samples of books in the Hathi Trust Digital Library from different areas of the Library of Congress Classification Outline. For each classification area, we train several topic models over the entire class with different random seeds, generating a set of spanning models. Then, we train topic models on random samples of books from the classification area, generating a set of sample models. Finally, we perform a topic alignment between each pair of models by computing the Jensen-Shannon distance (JSD) between the word…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Diverse Approaches in Healthcare and Education Studies
