ProSiT! Latent Variable Discovery with PROgressive SImilarity Thresholds

Tommaso Fornaciari; Dirk Hovy; Federico Bianchi

arXiv:2210.14763·cs.CL·October 27, 2022

ProSiT! Latent Variable Discovery with PROgressive SImilarity Thresholds

Tommaso Fornaciari, Dirk Hovy, Federico Bianchi

PDF

Open Access 1 Repo

TL;DR

ProSiT is a deterministic, flexible method for discovering latent document dimensions that automatically determines the optimal number of topics, outperforming traditional topic models and clustering methods on multiple metrics.

Contribution

ProSiT introduces a novel, interpretable approach that finds the number of latent dimensions without stochasticity, requiring only two hyper-parameters and demonstrating superior performance.

Findings

01

ProSiT matches or outperforms existing methods on coherence and distinctiveness.

02

It produces replicable, deterministic results across benchmark datasets.

03

The method is agnostic to input format and easy to tune.

Abstract

The most common ways to explore latent document dimensions are topic models and clustering methods. However, topic models have several drawbacks: e.g., they require us to choose the number of latent dimensions a priori, and the results are stochastic. Most clustering methods have the same issues and lack flexibility in various ways, such as not accounting for the influence of different topics on single documents, forcing word-descriptors to belong to a single topic (hard-clustering) or necessarily relying on word representations. We propose PROgressive SImilarity Thresholds - ProSiT, a deterministic and interpretable method, agnostic to the input format, that finds the optimal number of latent dimensions and only has two hyper-parameters, which can be set efficiently via grid search. We compare this method with a wide range of topic models and clustering methods on four benchmark data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fornaciari/prosit
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies