Predicting Good Configurations for GitHub and Stack Overflow Topic Models
Christoph Treude, Markus Wagner

TL;DR
This paper studies how to optimally configure LDA for topic modeling on GitHub and Stack Overflow data, revealing that different corpora require tailored parameters and proposing a method to predict good configurations for new datasets.
Contribution
It provides a comprehensive analysis of LDA parameter settings for software repository texts and introduces a predictive approach for optimal configurations on unseen corpora.
Findings
Popular LDA parameter rules are not universally applicable.
GitHub and Stack Overflow corpora require different configurations.
Good configurations can be reliably predicted for new corpora.
Abstract
Software repositories contain large amounts of textual data, ranging from source code comments and issue descriptions to questions, answers, and comments on Stack Overflow. To make sense of this textual data, topic modelling is frequently used as a text-mining tool for the discovery of hidden semantic structures in text bodies. Latent Dirichlet allocation (LDA) is a commonly used topic model that aims to explain the structure of a corpus by grouping texts. LDA requires multiple parameters to work well, and there are only rough and sometimes conflicting guidelines available on how these parameters should be set. In this paper, we contribute (i) a broad study of parameters to arrive at good local optima for GitHub and Stack Overflow text corpora, (ii) an a-posteriori characterisation of text corpora related to eight programming languages, and (iii) an analysis of corpus feature importance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
