Improving Sentence Embeddings with Automatic Generation of Training Data   Using Few-shot Examples

Soma Sato; Hayato Tsukagoshi; Ryohei Sasano; Koichi Takeda

arXiv:2402.15132·cs.CL·August 5, 2024·1 cites

Improving Sentence Embeddings with Automatic Generation of Training Data Using Few-shot Examples

Soma Sato, Hayato Tsukagoshi, Ryohei Sasano, Koichi Takeda

PDF

Open Access 1 Repo

TL;DR

This paper proposes a method to improve sentence embeddings by automatically generating training data using few-shot learning with large language models, eliminating the need for large annotated datasets.

Contribution

It introduces a novel approach to generate training data automatically for sentence embedding models using few-shot examples, enhancing performance without manual annotations.

Findings

01

Outperforms existing models on semantic textual similarity tasks

02

Effective automatic dataset generation with few-shot learning

03

Reduces dependency on manually annotated datasets

Abstract

Decoder-based large language models (LLMs) have shown high performance on many tasks in natural language processing. This is also true for sentence embedding learning, where a decoder-based model, PromptEOL, has achieved the best performance on semantic textual similarity (STS) tasks. However, PromptEOL requires a manually annotated natural language inference (NLI) dataset for fine-tuning. We aim to improve sentence embeddings without using large manually annotated datasets by automatically generating an NLI dataset with an LLM and using it for fine-tuning of PromptEOL. To achieve this, we explore methods of data generation suitable for sentence embedding learning in this study. Specifically, we will focus on automatic dataset generation through few-shot learning and explore the appropriate methods to leverage few-shot examples. Experimental results on the STS tasks demonstrate that our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lamsoma/auto_nli
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsFocus