CAST: Corpus-Aware Self-similarity Enhanced Topic modelling

Yanan Ma; Chenghao Xiao; Chenhan Yuan; Sabine N van der Veer; Lamiece; Hassan; Chenghua Lin; Goran Nenadic

arXiv:2410.15136·cs.CL·February 7, 2025·2 cites

CAST: Corpus-Aware Self-similarity Enhanced Topic modelling

Yanan Ma, Chenghao Xiao, Chenhan Yuan, Sabine N van der Veer, Lamiece, Hassan, Chenghua Lin, Goran Nenadic

PDF

Open Access

TL;DR

CAST introduces a novel corpus-aware self-similarity approach to improve neural topic modeling by filtering out non-topical words, resulting in more coherent and diverse topics, especially in noisy datasets.

Contribution

It proposes a new method that utilizes dataset-contextualized embeddings and self-similarity metrics to enhance topic quality and robustness against noise.

Findings

01

Outperforms strong baselines on news and Twitter datasets.

02

Improves topic coherence and diversity.

03

Effectively filters out functional and noisy words.

Abstract

Topic modelling is a pivotal unsupervised machine learning technique for extracting valuable insights from large document collections. Existing neural topic modelling methods often encode contextual information of documents, while ignoring contextual details of candidate centroid words, leading to the inaccurate selection of topic words due to the contextualization gap. In parallel, it is found that functional words are frequently selected over topical words. To address these limitations, we introduce CAST: Corpus-Aware Self-similarity Enhanced Topic modelling, a novel topic modelling method that builds upon candidate centroid word embeddings contextualized on the dataset, and a novel self-similarity-based method to filter out less meaningful tokens. Inspired by findings in contrastive learning that self-similarities of functional token embeddings in different contexts are much lower…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods

MethodsContrastive Learning