Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang; Nan Yang; Xiaolong Huang; Binxing Jiao; Linjun Yang; Daxin; Jiang; Rangan Majumder; Furu Wei

arXiv:2212.03533·cs.CL·February 23, 2024·114 cites

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin, Jiang, Rangan Majumder, Furu Wei

PDF

Open Access 1 Repo 10 Models 1 Datasets

TL;DR

E5 is a state-of-the-art text embedding model trained with weak supervision that excels in various tasks like retrieval and classification, outperforming traditional methods in zero-shot and fine-tuned scenarios.

Contribution

The paper introduces E5, a new contrastively trained text embedding model that achieves superior performance across multiple benchmarks with minimal supervision.

Findings

01

E5 outperforms BM25 in zero-shot retrieval on BEIR.

02

E5 achieves top results on MTEB with fewer parameters.

03

E5 demonstrates strong transferability across diverse tasks.

Abstract

This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot settings, E5 is the first model that outperforms the strong BM25 baseline on the BEIR retrieval benchmark without using any labeled data. When fine-tuned, E5 obtains the best results on the MTEB benchmark, beating existing embedding models with 40x more parameters.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/unilm
pytorchOfficial

Models

Datasets

ai-forever/solyanka
dataset· 570 dl
570 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques