VN-MTEB: Vietnamese Massive Text Embedding Benchmark
Loc Pham, Tung Luu, Thu Vo, Minh Nguyen, Viet Hoang

TL;DR
This paper introduces VN-MTEB, a large-scale Vietnamese text embedding benchmark created by translating and filtering existing datasets, enabling better evaluation of embedding models for Vietnamese language applications.
Contribution
The paper presents the first comprehensive Vietnamese text embedding benchmark with 41 datasets, developed through an automated translation and filtering process for high-quality evaluation.
Findings
Larger models with Rotary Positional Embedding outperform smaller ones.
The benchmark covers six tasks for Vietnamese text embeddings.
High-quality datasets facilitate better model evaluation.
Abstract
Vietnam ranks among the top countries in terms of both internet traffic and online toxicity. As a result, implementing embedding models for recommendation and content control duties in applications is crucial. However, a lack of large-scale test datasets, both in volume and task diversity, makes it tricky for scientists to effectively evaluate AI models before deploying them in real-world, large-scale projects. To solve this important problem, we introduce a Vietnamese benchmark, VN-MTEB for embedding models, which we created by translating a large number of English samples from the Massive Text Embedding Benchmark using our new automated framework. We leverage the strengths of large language models (LLMs) and cutting-edge embedding models to conduct translation and filtering processes to retain high-quality samples, guaranteeing a natural flow of language and semantic fidelity while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
