Clustering and Network Analysis for the Embedding Spaces of Sentences   and Sub-Sentences

Yuan An; Alexander Kalinowski; Jane Greenberg

arXiv:2110.00697·cs.CL·October 5, 2021

Clustering and Network Analysis for the Embedding Spaces of Sentences and Sub-Sentences

Yuan An, Alexander Kalinowski, Jane Greenberg

PDF

1 Repo

TL;DR

This paper investigates the latent structure of sentence embeddings through clustering and network analysis, revealing how sentence length and structure influence embedding space topology and identifying the most clusterable embedding methods.

Contribution

It provides the first comprehensive analysis of sentence and sub-sentence embedding spaces, highlighting the impact of sentence structure on clustering properties and proposing insights for future embedding models.

Findings

01

Sub-sentence embeddings exhibit better clustering than full sentences.

02

One embedding method produces the most clusterable representations.

03

Results inform future development of sentence embedding techniques.

Abstract

Sentence embedding methods offer a powerful approach for working with short textual constructs or sequences of words. By representing sentences as dense numerical vectors, many natural language processing (NLP) applications have improved their performance. However, relatively little is understood about the latent structure of sentence embeddings. Specifically, research has not addressed whether the length and structure of sentences impact the sentence embedding space and topology. This paper reports research on a set of comprehensive clustering and network analyses targeting sentence and sub-sentence embedding spaces. Results show that one method generates the most clusterable embeddings. In general, the embeddings of span sub-sentences have better clustering properties than the original sentences. The results have implications for future sentence embedding models and applications.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sent-subsent-embs/clustering-network-analysis
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.