Are Classes Clusters?

Kees Varekamp

arXiv:2104.07840·cs.CL·April 19, 2021

Are Classes Clusters?

Kees Varekamp

PDF

Open Access

TL;DR

This paper evaluates whether recent sentence embedding models naturally form meaningful clusters corresponding to topic classes, revealing that unsupervised clustering can partially reconstruct class labels in real-world datasets.

Contribution

It provides an empirical analysis of four sentence embedding models' ability to cluster topic classes without supervision.

Findings

01

Clustering embeddings partially recovers topic classes.

02

Unsupervised clustering outperforms random chance.

03

Embedding models show potential for unsupervised topic discovery.

Abstract

Sentence embedding models aim to provide general purpose embeddings for sentences. Most of the models studied in this paper claim to perform well on STS tasks - but they do not report on their suitability for clustering. This paper looks at four recent sentence embedding models (Universal Sentence Encoder (Cer et al., 2018), Sentence-BERT (Reimers and Gurevych, 2019), LASER (Artetxe and Schwenk, 2019), and DeCLUTR (Giorgi et al., 2020)). It gives a brief overview of the ideas behind their implementations. It then investigates how well topic classes in two text classification datasets (Amazon Reviews (Ni et al., 2019) and News Category Dataset (Misra, 2018)) map to clusters in their corresponding sentence embedding space. While the performance of the resulting classification model is far from perfect, it is better than random. This is interesting because the classification model has been…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Sentiment Analysis and Opinion Mining

MethodsDeCLUTR