TL;DR
This paper investigates the latent structure of sentence embeddings through clustering and network analysis, revealing how sentence length and structure influence embedding space topology and identifying the most clusterable embedding methods.
Contribution
It provides the first comprehensive analysis of sentence and sub-sentence embedding spaces, highlighting the impact of sentence structure on clustering properties and proposing insights for future embedding models.
Findings
Sub-sentence embeddings exhibit better clustering than full sentences.
One embedding method produces the most clusterable representations.
Results inform future development of sentence embedding techniques.
Abstract
Sentence embedding methods offer a powerful approach for working with short textual constructs or sequences of words. By representing sentences as dense numerical vectors, many natural language processing (NLP) applications have improved their performance. However, relatively little is understood about the latent structure of sentence embeddings. Specifically, research has not addressed whether the length and structure of sentences impact the sentence embedding space and topology. This paper reports research on a set of comprehensive clustering and network analyses targeting sentence and sub-sentence embedding spaces. Results show that one method generates the most clusterable embeddings. In general, the embeddings of span sub-sentences have better clustering properties than the original sentences. The results have implications for future sentence embedding models and applications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
