Towards General Text Embeddings with Multi-stage Contrastive Learning
Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan, Zhang

TL;DR
GTE is a versatile text embedding model trained with multi-stage contrastive learning, achieving superior performance across NLP and code tasks with a modest parameter size by leveraging extensive datasets.
Contribution
The paper introduces GTE, a unified text embedding model trained with multi-stage contrastive learning on diverse datasets, outperforming larger models without task-specific fine-tuning.
Findings
GTE outperforms OpenAI's embedding API with fewer parameters.
GTE surpasses larger models on the massive text embedding benchmark.
The model effectively handles code as text without language-specific fine-tuning.
Abstract
We present GTE, a general-purpose text embedding model trained with multi-stage contrastive learning. In line with recent advancements in unifying various NLP tasks into a single format, we train a unified text embedding model by employing contrastive learning over a diverse mixture of datasets from multiple sources. By significantly increasing the number of training data during both unsupervised pre-training and supervised fine-tuning stages, we achieve substantial performance gains over existing embedding models. Notably, even with a relatively modest parameter count of 110M, GTE outperforms the black-box embedding API provided by OpenAI and even surpasses 10x larger text embedding models on the massive text embedding benchmark. Furthermore, without additional fine-tuning on each programming language individually, our model outperforms previous best code retrievers of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Alibaba-NLP/gte-modernbert-basemodel· 213k dl· ♡ 195213k dl♡ 195
- 🤗Alibaba-NLP/gte-base-en-v1.5model· 391k dl· ♡ 70391k dl♡ 70
- 🤗thenlper/gte-basemodel· 231k dl· ♡ 131231k dl♡ 131
- 🤗thenlper/gte-largemodel· 1.1M dl· ♡ 3001.1M dl♡ 300
- 🤗thenlper/gte-smallmodel· 780k dl· ♡ 185780k dl♡ 185
- 🤗jncraton/gte-small-ct2-int8model· 18 dl18 dl
- 🤗mjwong/gte-large-mnli-anlimodel· 11 dl· ♡ 111 dl♡ 1
- 🤗dhairya0907/thenlper-get-largemodel· 22 dl22 dl
- 🤗barisaydin/gte-basemodel· 19 dl19 dl
- 🤗barisaydin/gte-largemodel· 25 dl25 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Hate Speech and Cyberbullying Detection · Text and Document Classification Technologies
MethodsContrastive Learning
