On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey

Meishan Zhang; Xin Zhang; Xinping Zhao; Shouzheng Huang; Baotian Hu; Min Zhang

arXiv:2507.20783·cs.CL·November 27, 2025

On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey

Meishan Zhang, Xin Zhang, Xinping Zhao, Shouzheng Huang, Baotian Hu, Min Zhang

PDF

TL;DR

This survey reviews the role of pretrained language models in developing general-purpose text embeddings, highlighting architectures, strategies, advanced capabilities, and future research directions in NLP applications.

Contribution

It provides a comprehensive overview of how PLMs influence GPTE development, including fundamental and advanced roles, and discusses future research avenues.

Findings

01

PLMs are central to deriving rich, transferable text embeddings.

02

Advanced roles include multilingual, multimodal, and code understanding capabilities.

03

Future directions involve bias mitigation, safety, and cognitive extensions.

Abstract

Text embeddings have attracted growing interest due to their effectiveness across a wide range of natural language processing (NLP) tasks, including retrieval, classification, clustering, bitext mining, and summarization. With the emergence of pretrained language models (PLMs), general-purpose text embeddings (GPTE) have gained significant traction for their ability to produce rich, transferable representations. The general architecture of GPTE typically leverages PLMs to derive dense text representations, which are then optimized through contrastive learning on large-scale pairwise datasets. In this survey, we provide a comprehensive overview of GPTE in the era of PLMs, focusing on the roles PLMs play in driving its development. We first examine the fundamental architecture and describe the basic roles of PLMs in GPTE, i.e., embedding extraction, expressivity enhancement, training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.