GISTEmbed: Guided In-sample Selection of Training Negatives for Text   Embedding Fine-tuning

Aivin V. Solatorio

arXiv:2402.16829·cs.LG·February 27, 2024·5 cites

GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning

Aivin V. Solatorio

PDF

Open Access 1 Repo 10 Models 5 Datasets

TL;DR

GISTEmbed introduces a guided in-sample negative selection method for contrastive training of text embeddings, significantly improving model performance and reducing noise compared to traditional unsupervised approaches.

Contribution

It presents a novel strategy that enhances negative sampling during training using a guide model, outperforming random sampling and achieving state-of-the-art results.

Findings

01

Consistent performance improvements across various models on MTEB benchmark

02

State-of-the-art results in select categories

03

Enhanced training efficiency for smaller models

Abstract

Embedding models are integral to AI applications like semantic search, personalized recommendations, and retrieval augmented generation for LLMs, necessitating high-quality training data. However, the limited scalability of manual data curation prompts the need for automated methods to ensure data integrity. Traditional unsupervised triplet mining automates training data generation, crucial for embedding model training, yet inadvertently injects biases and noise, thereby degrading model performance. Addressing this, we introduce GISTEmbed, a novel strategy that enhances in-batch negative selection during contrastive training through a guide model. This approach departs from reliance on random sampling and equal utility assumption of batch negatives, significantly reducing noise from data quality issues and improving model fine-tuning. Benchmarked against the Massive Text Embedding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

avsolatorio/gistembed
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques