No Pattern, No Recognition: a Survey about Reproducibility and Distortion Issues of Text Clustering and Topic Modeling
Mar\'ilia Costa Rosendo Silva, Felipe Alves Siqueira, Jo\~ao Pedro, Mantovani Tarrega, Jo\~ao Vitor Pataca Beinotti, Augusto Sousa Nunes, Miguel, de Mattos Gardini, Vin\'icius Adolfo Pereira da Silva, N\'adia F\'elix Felipe, da Silva

TL;DR
This survey reviews the reproducibility and distortion issues in text clustering and topic modeling, highlighting the impact of initialization, outliers, and anomalies on unsupervised learning results.
Contribution
It provides a systematic literature review from 2011 to 2022, clarifies terminology, and discusses research opportunities and open issues in the field.
Findings
Reproducibility issues stem from initialization variability.
Outliers and anomalies significantly distort clustering outcomes.
The survey identifies gaps and future research directions.
Abstract
Extracting knowledge from unlabeled texts using machine learning algorithms can be complex. Document categorization and information retrieval are two applications that may benefit from unsupervised learning (e.g., text clustering and topic modeling), including exploratory data analysis. However, the unsupervised learning paradigm poses reproducibility issues. The initialization can lead to variability depending on the machine learning algorithm. Furthermore, the distortions can be misleading when regarding cluster geometry. Amongst the causes, the presence of outliers and anomalies can be a determining factor. Despite the relevance of initialization and outlier issues for text clustering and topic modeling, the authors did not find an in-depth analysis of them. This survey provides a systematic literature review (2011-2022) of these subareas and proposes a common terminology since…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Computational and Text Analysis Methods · Topic Modeling
