Topic Stability over Noisy Sources

Jing Su; Ois\'in Boydell; Derek Greene; Gerard Lynch

arXiv:1508.01067·cs.CL·August 6, 2015·1 cites

Topic Stability over Noisy Sources

Jing Su, Ois\'in Boydell, Derek Greene, Gerard Lynch

PDF

Open Access

TL;DR

This paper investigates how different types of textual noise affect the stability of various topic models, providing guidelines for corpus creation and model selection in noisy data scenarios.

Contribution

It offers a comprehensive analysis of noise impacts on topic stability and proposes practical guidelines for corpus generation and model selection in noisy environments.

Findings

01

Different noise types impact topic stability diversely.

02

Guidelines for generating cleaner corpora are proposed.

03

Recommendations for selecting robust topic models in noisy data.

Abstract

Topic modelling techniques such as LDA have recently been applied to speech transcripts and OCR output. These corpora may contain noisy or erroneous texts which may undermine topic stability. Therefore, it is important to know how well a topic modelling algorithm will perform when applied to noisy data. In this paper we show that different types of textual noise will have diverse effects on the stability of different topic models. From these observations, we propose guidelines for text corpus generation, with a focus on automatic speech transcription. We also suggest topic model selection methods for noisy corpora.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques

MethodsLinear Discriminant Analysis