CAST: Corpus-Aware Self-similarity Enhanced Topic modelling
Yanan Ma, Chenghao Xiao, Chenhan Yuan, Sabine N van der Veer, Lamiece, Hassan, Chenghua Lin, Goran Nenadic

TL;DR
CAST introduces a novel corpus-aware self-similarity approach to improve neural topic modeling by filtering out non-topical words, resulting in more coherent and diverse topics, especially in noisy datasets.
Contribution
It proposes a new method that utilizes dataset-contextualized embeddings and self-similarity metrics to enhance topic quality and robustness against noise.
Findings
Outperforms strong baselines on news and Twitter datasets.
Improves topic coherence and diversity.
Effectively filters out functional and noisy words.
Abstract
Topic modelling is a pivotal unsupervised machine learning technique for extracting valuable insights from large document collections. Existing neural topic modelling methods often encode contextual information of documents, while ignoring contextual details of candidate centroid words, leading to the inaccurate selection of topic words due to the contextualization gap. In parallel, it is found that functional words are frequently selected over topical words. To address these limitations, we introduce CAST: Corpus-Aware Self-similarity Enhanced Topic modelling, a novel topic modelling method that builds upon candidate centroid word embeddings contextualized on the dataset, and a novel self-similarity-based method to filter out less meaningful tokens. Inspired by findings in contrastive learning that self-similarities of functional token embeddings in different contexts are much lower…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods
MethodsContrastive Learning
