PL-MTEB: Polish Massive Text Embedding Benchmark

Rafa{\l} Po\'swiata; S{\l}awomir Dadas; Micha{\l} Pere{\l}kiewicz

arXiv:2405.10138·cs.CL·April 27, 2026

PL-MTEB: Polish Massive Text Embedding Benchmark

Rafa{\l} Po\'swiata, S{\l}awomir Dadas, Micha{\l} Pere{\l}kiewicz

PDF

1 Repo

TL;DR

PL-MTEB is a comprehensive benchmark for evaluating Polish language text embeddings across diverse NLP tasks, including new datasets and extensive model analysis.

Contribution

Introduces the first large-scale Polish text embedding benchmark with new datasets, evaluation code, and analysis of multiple models.

Findings

01

Evaluated 30 text embedding models on Polish NLP tasks.

02

Provided detailed analysis of model performance by task type and size.

03

Made datasets and evaluation tools publicly available.

Abstract

In this paper, we introduce the Polish Massive Text Embedding Benchmark (PL-MTEB), a comprehensive benchmark for text embeddings in the Polish language. PL-MTEB comprises 30 diverse NLP tasks across five categories: classification, clustering, pair classification, information retrieval, and semantic text similarity. Within the scope of this work, we added 12 new Polish-language tasks to MTEB based on existing datasets and prepared two new datasets used to create four clustering tasks. We evaluated 30 publicly available text embedding models, including Polish and multilingual models. We analyzed the results in detail for specific task types and model sizes. We made the prepared datasets, the source code for evaluation, and the obtained results available to the public at https://github.com/rafalposwiata/pl-mteb.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rafalposwiata/pl-mteb
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.