Narrowing the Gap between Supervised and Unsupervised Sentence Representation Learning with Large Language Model
Mingxin Li, Richong Zhang, Zhijie Nie, Yongyi Mao

TL;DR
This paper investigates the performance gap between supervised and unsupervised sentence representation learning, identifying similarity pattern complexity as a key factor, and proposes using large language models to generate complex training data to narrow this gap.
Contribution
It introduces the Relative Fitting Difficulty metric and leverages large language models to generate complex data patterns, effectively reducing the performance gap in sentence embedding learning.
Findings
Similarity pattern complexity influences performance gap
Using LLM-generated data with hierarchical patterns narrows the gap
The proposed method improves unsupervised CSE performance
Abstract
Sentence Representation Learning (SRL) is a fundamental task in Natural Language Processing (NLP), with the Contrastive Learning of Sentence Embeddings (CSE) being the mainstream technique due to its superior performance. An intriguing phenomenon in CSE is the significant performance gap between supervised and unsupervised methods, with their only difference lying in the training data. Previous works attribute this performance gap to differences in two representation properties (alignment and uniformity). However, since alignment and uniformity only measure the results, they fail to answer "What aspects of the training data contribute to the performance gap?" and "How can the performance gap be narrowed?", In this paper, we conduct empirical experiments to answer these "What" and "How" questions. We first answer the "What" question by thoroughly comparing the behavior of supervised and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Sentiment Analysis and Opinion Mining
MethodsContrastive Learning
