Sim-GPT: Text Similarity via GPT Annotated Data
Shuhe Wang, Beiming Cao, Shengyu Zhang, Xiaoya Li, Jiwei Li, Fei Wu,, Guoyin Wang, Eduard Hovy

TL;DR
Sim-GPT leverages GPT-4 to generate high-quality labeled data for semantic textual similarity, enabling training of effective models that outperform existing methods on multiple benchmarks.
Contribution
The paper introduces a novel approach of using GPT-4 to generate annotated data for STS, significantly reducing reliance on costly human annotations and achieving state-of-the-art results.
Findings
Achieved SOTA performance on seven STS benchmarks.
Generated 371K annotated examples using GPT-4.
Reduced costs and improved efficiency by training models on generated data.
Abstract
Due to the lack of a large collection of high-quality labeled sentence pairs with textual similarity scores, existing approaches for Semantic Textual Similarity (STS) mostly rely on unsupervised techniques or training signals that are only partially correlated with textual similarity, e.g., NLI-based datasets. To tackle this issue, in this paper, we propose the strategy of measuring text similarity via GPT annotated data (Sim-GPT for short). The core idea of Sim-GPT is to generate data with STS labels using GPT-4, based on which an STS model is trained. Sim-GPT framework utilizes LLMs to provide a substantial amount of reliable annotated data filling the gap of the lack of training signals for STS. Sim-GPT is trained on a one-time generated dataset using BERT or RoBERTa as the backbone, which offers long-term savings in cost and speed compared to repeatedly invoking LLMs for each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Biomedical Text Mining and Ontologies
MethodsMulti-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Transformer · Discriminative Fine-Tuning · Refunds@Expedia|||How do I get a full refund from Expedia? · Cosine Annealing · Linear Layer
