What is Wrong with Topic Modeling? (and How to Fix it Using Search-based Software Engineering)
Amritanshu Agrawal, Wei Fu, Tim Menzies

TL;DR
This paper identifies the instability problem in standard LDA topic modeling across different datasets and proposes LDADE, a search-based parameter tuning method, to improve stability and classification performance.
Contribution
The paper introduces LDADE, a novel search-based tuning approach that significantly reduces LDA's topic instability and enhances text mining accuracy in software engineering contexts.
Findings
LDA exhibits high topic instability without tuning.
LDADE dramatically reduces cluster instability.
LDADE improves classification accuracy.
Abstract
Context: Topic modeling finds human-readable structures in unstructured textual data. A widely used topic modeler is Latent Dirichlet allocation. When run on different datasets, LDA suffers from "order effects" i.e. different topics are generated if the order of training data is shuffled. Such order effects introduce a systematic error for any study. This error can relate to misleading results;specifically, inaccurate topic descriptions and a reduction in the efficacy of text mining classification results. Objective: To provide a method in which distributions generated by LDA are more stable and can be used for further analysis. Method: We use LDADE, a search-based software engineering tool that tunes LDA's parameters using DE (Differential Evolution). LDADE is evaluated on data from a programmer information exchange site (Stackoverflow), title and abstract text of thousands ofSoftware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Discriminant Analysis
