Language Models are Universal Embedders

Xin Zhang; Zehan Li; Yanzhao Zhang; Dingkun Long; Pengjun Xie; Meishan Zhang; Min Zhang

arXiv:2310.08232·cs.CL·May 23, 2025·2 cites

Language Models are Universal Embedders

Xin Zhang, Zehan Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, Min Zhang

PDF

Open Access 1 Repo 5 Models 1 Datasets

TL;DR

This paper demonstrates that pre-trained multilingual large language models can serve as effective universal embedders across various languages and tasks, even without task-specific fine-tuning.

Contribution

It introduces simple methods to create universal embedding models from multilingual LLMs and provides a benchmark to evaluate their performance across diverse scenarios.

Findings

01

Models produce high-quality embeddings across languages

02

Effective for multiple tasks without additional training

03

Supports languages and tasks with no finetuning data

Abstract

In the large language model (LLM) revolution, embedding is a key component of various systems, such as retrieving knowledge or memories for LLMs or building content moderation filters. As such cases span from English to other natural or programming languages, from retrieval to classification and beyond, it is advantageous to build a unified embedding model rather than dedicated ones for each scenario. In this context, the pre-trained multilingual decoder-only large language models, e.g., BLOOM, emerge as a viable backbone option. To assess their potential, we propose straightforward strategies for constructing embedders and introduce a universal evaluation benchmark. Experimental results show that our trained model is proficient at generating good embeddings across languages and tasks, even extending to languages and tasks for which no finetuning/pretraining data is available. We also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

izhx/uni-rep
noneOfficial

Models

Datasets

izhx/google-code-jam
dataset· 15 dl
15 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification