Topic Modelling of Empirical Text Corpora: Validity, Reliability, and Reproducibility in Comparison to Semantic Maps
Tobias Hecking, Loet Leydesdorff

TL;DR
This study compares LDA and PCA topic models on a large societal impact corpus, highlighting differences in stability, semantic coherence, and implications for their use in research evaluation.
Contribution
It demonstrates the differing robustness and semantic quality of LDA versus PCA models on empirical text data, emphasizing the importance of validation.
Findings
LDA is more sensitive to document removal than PCA.
LDA outperforms PCA in semantic coherence.
Statistical properties of models should not be used for semantic interpretation.
Abstract
Using the 6,638 case descriptions of societal impact submitted for evaluation in the Research Excellence Framework (REF 2014), we replicate the topic model (Latent Dirichlet Allocation or LDA) made in this context and compare the results with factor-analytic results using a traditional word-document matrix (Principal Component Analysis or PCA). Removing a small fraction of documents from the sample, for example, has on average a much larger impact on LDA than on PCA-based models to the extent that the largest distortion in the case of PCA has less effect than the smallest distortion of LDA-based models. In terms of semantic coherence, however, LDA models outperform PCA-based models. The topic models inform us about the statistical properties of the document sets under study, but the results are statistical and should not be used for a semantic interpretation - for example, in grant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Topic Modeling · scientometrics and bibliometrics research
MethodsLinear Discriminant Analysis
