Topic Stability over Noisy Sources
Jing Su, Ois\'in Boydell, Derek Greene, Gerard Lynch

TL;DR
This paper investigates how different types of textual noise affect the stability of various topic models, providing guidelines for corpus creation and model selection in noisy data scenarios.
Contribution
It offers a comprehensive analysis of noise impacts on topic stability and proposes practical guidelines for corpus generation and model selection in noisy environments.
Findings
Different noise types impact topic stability diversely.
Guidelines for generating cleaner corpora are proposed.
Recommendations for selecting robust topic models in noisy data.
Abstract
Topic modelling techniques such as LDA have recently been applied to speech transcripts and OCR output. These corpora may contain noisy or erroneous texts which may undermine topic stability. Therefore, it is important to know how well a topic modelling algorithm will perform when applied to noisy data. In this paper we show that different types of textual noise will have diverse effects on the stability of different topic models. From these observations, we propose guidelines for text corpus generation, with a focus on automatic speech transcription. We also suggest topic model selection methods for noisy corpora.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
MethodsLinear Discriminant Analysis
