Large Language Models can Contrastively Refine their Generation for Better Sentence Representation Learning
Huiming Wang, Zhaodonghui Li, Liying Cheng, Soh De Wen, Lidong Bing

TL;DR
This paper introduces MultiCSR, a multi-stage framework that refines LLM-generated sentence data for contrastive learning, significantly improving sentence representations even with less advanced models.
Contribution
The paper proposes a novel multi-level refinement process for LLM-generated data, enhancing contrastive sentence embedding training and achieving state-of-the-art results.
Findings
MultiCSR improves sentence embedding quality with less advanced LLMs.
Refined data generation leads to better contrastive learning outcomes.
Applying MultiCSR to ChatGPT yields state-of-the-art performance.
Abstract
Recently, large language models (LLMs) have emerged as a groundbreaking technology and their unparalleled text generation capabilities have sparked interest in their application to the fundamental sentence representation learning task. Existing methods have explored utilizing LLMs as data annotators to generate synthesized data for training contrastive learning based sentence embedding models such as SimCSE. However, since contrastive learning models are sensitive to the quality of sentence pairs, the effectiveness of these methods is largely influenced by the content generated from LLMs, highlighting the need for more refined generation in the context of sentence representation learning. Building upon this premise, we propose MultiCSR, a multi-level contrastive sentence representation learning framework that decomposes the process of prompting LLMs to generate a corpus for training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsSimCSE · Balanced Selection · Focus · Contrastive Learning
