Large Language Models can Contrastively Refine their Generation for   Better Sentence Representation Learning

Huiming Wang; Zhaodonghui Li; Liying Cheng; Soh De Wen; Lidong Bing

arXiv:2310.10962·cs.CL·May 20, 2024·1 cites

Large Language Models can Contrastively Refine their Generation for Better Sentence Representation Learning

Huiming Wang, Zhaodonghui Li, Liying Cheng, Soh De Wen, Lidong Bing

PDF

Open Access 1 Repo 4 Models 1 Datasets 1 Video

TL;DR

This paper introduces MultiCSR, a multi-stage framework that refines LLM-generated sentence data for contrastive learning, significantly improving sentence representations even with less advanced models.

Contribution

The paper proposes a novel multi-level refinement process for LLM-generated data, enhancing contrastive sentence embedding training and achieving state-of-the-art results.

Findings

01

MultiCSR improves sentence embedding quality with less advanced LLMs.

02

Refined data generation leads to better contrastive learning outcomes.

03

Applying MultiCSR to ChatGPT yields state-of-the-art performance.

Abstract

Recently, large language models (LLMs) have emerged as a groundbreaking technology and their unparalleled text generation capabilities have sparked interest in their application to the fundamental sentence representation learning task. Existing methods have explored utilizing LLMs as data annotators to generate synthesized data for training contrastive learning based sentence embedding models such as SimCSE. However, since contrastive learning models are sensitive to the quality of sentence pairs, the effectiveness of these methods is largely influenced by the content generated from LLMs, highlighting the need for more refined generation in the context of sentence representation learning. Building upon this premise, we propose MultiCSR, a multi-level contrastive sentence representation learning framework that decomposes the process of prompting LLMs to generate a corpus for training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

circle-ming/multicsr
pytorchOfficial

Models

Datasets

leoner24/MultiCSR_NLI
dataset· 13 dl
13 dl

Videos

Large Language Models can Contrastively Refine their Generation for Better Sentence Representation Learning· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsSimCSE · Balanced Selection · Focus · Contrastive Learning