On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey
Meishan Zhang, Xin Zhang, Xinping Zhao, Shouzheng Huang, Baotian Hu, Min Zhang

TL;DR
This survey reviews the role of pretrained language models in developing general-purpose text embeddings, highlighting architectures, strategies, advanced capabilities, and future research directions in NLP applications.
Contribution
It provides a comprehensive overview of how PLMs influence GPTE development, including fundamental and advanced roles, and discusses future research avenues.
Findings
PLMs are central to deriving rich, transferable text embeddings.
Advanced roles include multilingual, multimodal, and code understanding capabilities.
Future directions involve bias mitigation, safety, and cognitive extensions.
Abstract
Text embeddings have attracted growing interest due to their effectiveness across a wide range of natural language processing (NLP) tasks, including retrieval, classification, clustering, bitext mining, and summarization. With the emergence of pretrained language models (PLMs), general-purpose text embeddings (GPTE) have gained significant traction for their ability to produce rich, transferable representations. The general architecture of GPTE typically leverages PLMs to derive dense text representations, which are then optimized through contrastive learning on large-scale pairwise datasets. In this survey, we provide a comprehensive overview of GPTE in the era of PLMs, focusing on the roles PLMs play in driving its development. We first examine the fundamental architecture and describe the basic roles of PLMs in GPTE, i.e., embedding extraction, expressivity enhancement, training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
