Text and Code Embeddings by Contrastive Pre-Training
Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael, Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy,, Johannes Heidecke, Pranav Shyam, Boris Power, Tyna Eloundou Nekoul, Girish, Sastry, Gretchen Krueger, David Schnurr

TL;DR
This paper demonstrates that contrastive pre-training on large-scale unsupervised data produces high-quality text and code embeddings, outperforming previous models in classification, semantic search, and code search tasks.
Contribution
The authors introduce a contrastive pre-training approach that yields state-of-the-art unsupervised text and code embeddings across multiple benchmarks.
Findings
Achieves 4% and 1.8% improvements in classification accuracy over previous models.
Attains 23.4%, 14.7%, and 10.6% improvements in semantic search benchmarks.
Obtains 20.8% improvement in code search performance.
Abstract
Text embeddings are useful features in many applications such as semantic search and computing text similarity. Previous work typically trains models customized for different use cases, varying in dataset choice, training objective and model architecture. In this work, we show that contrastive pre-training on unsupervised data at scale leads to high quality vector representations of text and code. The same unsupervised text embeddings that achieve new state-of-the-art results in linear-probe classification also display impressive semantic search capabilities and sometimes even perform competitively with fine-tuned models. On linear-probe classification accuracy averaging over 7 tasks, our best unsupervised model achieves a relative improvement of 4% and 1.8% over previous best unsupervised and supervised text embedding models respectively. The same text embeddings when evaluated on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research
