Narrowing the Gap between Supervised and Unsupervised Sentence   Representation Learning with Large Language Model

Mingxin Li; Richong Zhang; Zhijie Nie; Yongyi Mao

arXiv:2309.06453·cs.CL·December 20, 2023

Narrowing the Gap between Supervised and Unsupervised Sentence Representation Learning with Large Language Model

Mingxin Li, Richong Zhang, Zhijie Nie, Yongyi Mao

PDF

Open Access 1 Repo

TL;DR

This paper investigates the performance gap between supervised and unsupervised sentence representation learning, identifying similarity pattern complexity as a key factor, and proposes using large language models to generate complex training data to narrow this gap.

Contribution

It introduces the Relative Fitting Difficulty metric and leverages large language models to generate complex data patterns, effectively reducing the performance gap in sentence embedding learning.

Findings

01

Similarity pattern complexity influences performance gap

02

Using LLM-generated data with hierarchical patterns narrows the gap

03

The proposed method improves unsupervised CSE performance

Abstract

Sentence Representation Learning (SRL) is a fundamental task in Natural Language Processing (NLP), with the Contrastive Learning of Sentence Embeddings (CSE) being the mainstream technique due to its superior performance. An intriguing phenomenon in CSE is the significant performance gap between supervised and unsupervised methods, with their only difference lying in the training data. Previous works attribute this performance gap to differences in two representation properties (alignment and uniformity). However, since alignment and uniformity only measure the results, they fail to answer "What aspects of the training data contribute to the performance gap?" and "How can the performance gap be narrowed?", In this paper, we conduct empirical experiments to answer these "What" and "How" questions. We first answer the "What" question by thoroughly comparing the behavior of supervised and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bdbc-kg-nlp/ngcse
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Sentiment Analysis and Opinion Mining

MethodsContrastive Learning