Text Embeddings by Weakly-Supervised Contrastive Pre-training
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin, Jiang, Rangan Majumder, Furu Wei

TL;DR
E5 is a state-of-the-art text embedding model trained with weak supervision that excels in various tasks like retrieval and classification, outperforming traditional methods in zero-shot and fine-tuned scenarios.
Contribution
The paper introduces E5, a new contrastively trained text embedding model that achieves superior performance across multiple benchmarks with minimal supervision.
Findings
E5 outperforms BM25 in zero-shot retrieval on BEIR.
E5 achieves top results on MTEB with fewer parameters.
E5 demonstrates strong transferability across diverse tasks.
Abstract
This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot settings, E5 is the first model that outperforms the strong BM25 baseline on the BEIR retrieval benchmark without using any labeled data. When fine-tuned, E5 obtains the best results on the MTEB benchmark, beating existing embedding models with 40x more parameters.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗intfloat/e5-base-v2model· 1.6M dl· ♡ 1541.6M dl♡ 154
- 🤗intfloat/e5-smallmodel· 105k dl· ♡ 44105k dl♡ 44
- 🤗intfloat/e5-basemodel· 118k dl· ♡ 25118k dl♡ 25
- 🤗intfloat/e5-largemodel· 19k dl· ♡ 8019k dl♡ 80
- 🤗intfloat/e5-small-unsupervisedmodel· 285 dl285 dl
- 🤗intfloat/e5-base-unsupervisedmodel· 434 dl· ♡ 2434 dl♡ 2
- 🤗intfloat/e5-large-unsupervisedmodel· 1.7k dl· ♡ 61.7k dl♡ 6
- 🤗radames/e5-largemodel· 24 dl· ♡ 124 dl♡ 1
- 🤗mjwong/e5-large-mnlimodel· 10 dl· ♡ 110 dl♡ 1
- 🤗mjwong/e5-large-mnli-anlimodel· 3 dl3 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
