Towards General Text Embeddings with Multi-stage Contrastive Learning

Zehan Li; Xin Zhang; Yanzhao Zhang; Dingkun Long; Pengjun Xie; Meishan; Zhang

arXiv:2308.03281·cs.CL·August 8, 2023·58 cites

Towards General Text Embeddings with Multi-stage Contrastive Learning

Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan, Zhang

PDF

Open Access 10 Models 1 Datasets

TL;DR

GTE is a versatile text embedding model trained with multi-stage contrastive learning, achieving superior performance across NLP and code tasks with a modest parameter size by leveraging extensive datasets.

Contribution

The paper introduces GTE, a unified text embedding model trained with multi-stage contrastive learning on diverse datasets, outperforming larger models without task-specific fine-tuning.

Findings

01

GTE outperforms OpenAI's embedding API with fewer parameters.

02

GTE surpasses larger models on the massive text embedding benchmark.

03

The model effectively handles code as text without language-specific fine-tuning.

Abstract

We present GTE, a general-purpose text embedding model trained with multi-stage contrastive learning. In line with recent advancements in unifying various NLP tasks into a single format, we train a unified text embedding model by employing contrastive learning over a diverse mixture of datasets from multiple sources. By significantly increasing the number of training data during both unsupervised pre-training and supervised fine-tuning stages, we achieve substantial performance gains over existing embedding models. Notably, even with a relatively modest parameter count of 110M, GTE $_{base}$ outperforms the black-box embedding API provided by OpenAI and even surpasses 10x larger text embedding models on the massive text embedding benchmark. Furthermore, without additional fine-tuning on each programming language individually, our model outperforms previous best code retrievers of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

argilla-warehouse/personahub-fineweb-edu-4-embeddings
dataset· 64 dl
64 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Hate Speech and Cyberbullying Detection · Text and Document Classification Technologies

MethodsContrastive Learning