GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning
Aivin V. Solatorio

TL;DR
GISTEmbed introduces a guided in-sample negative selection method for contrastive training of text embeddings, significantly improving model performance and reducing noise compared to traditional unsupervised approaches.
Contribution
It presents a novel strategy that enhances negative sampling during training using a guide model, outperforming random sampling and achieving state-of-the-art results.
Findings
Consistent performance improvements across various models on MTEB benchmark
State-of-the-art results in select categories
Enhanced training efficiency for smaller models
Abstract
Embedding models are integral to AI applications like semantic search, personalized recommendations, and retrieval augmented generation for LLMs, necessitating high-quality training data. However, the limited scalability of manual data curation prompts the need for automated methods to ensure data integrity. Traditional unsupervised triplet mining automates training data generation, crucial for embedding model training, yet inadvertently injects biases and noise, thereby degrading model performance. Addressing this, we introduce GISTEmbed, a novel strategy that enhances in-batch negative selection during contrastive training through a guide model. This approach departs from reliance on random sampling and equal utility assumption of batch negatives, significantly reducing noise from data quality issues and improving model fine-tuning. Benchmarked against the Massive Text Embedding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗avsolatorio/GIST-Embedding-v0model· 134k dl· ♡ 30134k dl♡ 30
- 🤗avsolatorio/GIST-all-MiniLM-L6-v2model· 19k dl· ♡ 1019k dl♡ 10
- 🤗avsolatorio/GIST-small-Embedding-v0model· 141k dl· ♡ 33141k dl♡ 33
- 🤗avsolatorio/GIST-large-Embedding-v0model· 3.4k dl· ♡ 183.4k dl♡ 18
- 🤗ChristianAzinn/gist-embedding-v0-ggufmodel· 203 dl203 dl
- 🤗ChristianAzinn/gist-large-embedding-v0-ggufmodel· 195 dl195 dl
- 🤗ChristianAzinn/gist-small-embedding-v0-ggufmodel· 263 dl263 dl
- 🤗jncraton/GIST-small-Embedding-v0-ct2-int8model· 33 dl33 dl
- 🤗tomaarsen/distilroberta-base-nli-v3model· 1 dl1 dl
- 🤗avsolatorio/60-600-13-1-4-2-0-0-384-512-cm_NOI_avsolatorio_GIST-all-MiniLM-L6-v2-20240511003225-bestmodel· 76 dl76 dl
- avsolatorio/mteb-amazon_massive_scenario-avs_tripletsdataset· 5 dl5 dl
- avsolatorio/mteb-amazon_massive_intent-avs_tripletsdataset· 7 dl7 dl
- avsolatorio/mteb-banking77-avs_tripletsdataset· 8 dl8 dl
- avsolatorio/mteb-emotion-avs_tripletsdataset· 15 dl15 dl
- avsolatorio/mteb-amazon_counterfactual-avs_tripletsdataset· 5 dl5 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
