VN-MTEB: Vietnamese Massive Text Embedding Benchmark

Loc Pham; Tung Luu; Thu Vo; Minh Nguyen; Viet Hoang

arXiv:2507.21500·cs.CL·July 30, 2025

VN-MTEB: Vietnamese Massive Text Embedding Benchmark

Loc Pham, Tung Luu, Thu Vo, Minh Nguyen, Viet Hoang

PDF

5 Datasets 1 Video

TL;DR

This paper introduces VN-MTEB, a large-scale Vietnamese text embedding benchmark created by translating and filtering existing datasets, enabling better evaluation of embedding models for Vietnamese language applications.

Contribution

The paper presents the first comprehensive Vietnamese text embedding benchmark with 41 datasets, developed through an automated translation and filtering process for high-quality evaluation.

Findings

01

Larger models with Rotary Positional Embedding outperform smaller ones.

02

The benchmark covers six tasks for Vietnamese text embeddings.

03

High-quality datasets facilitate better model evaluation.

Abstract

Vietnam ranks among the top countries in terms of both internet traffic and online toxicity. As a result, implementing embedding models for recommendation and content control duties in applications is crucial. However, a lack of large-scale test datasets, both in volume and task diversity, makes it tricky for scientists to effectively evaluate AI models before deploying them in real-world, large-scale projects. To solve this important problem, we introduce a Vietnamese benchmark, VN-MTEB for embedding models, which we created by translating a large number of English samples from the Massive Text Embedding Benchmark using our new automated framework. We leverage the strengths of large language models (LLMs) and cutting-edge embedding models to conduct translation and filtering processes to retain high-quality samples, guaranteeing a natural flow of language and semantic fidelity while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

VN-MTEB: Vietnamese Massive Text Embedding Benchmark· underline