Less is More: Adapting Text Embeddings for Low-Resource Languages with Small Scale Noisy Synthetic Data

Zaruhi Navasardyan; Spartak Bughdaryan; Bagrat Minasyan; Hrant Davtyan

arXiv:2603.22290·cs.CL·March 25, 2026

Less is More: Adapting Text Embeddings for Low-Resource Languages with Small Scale Noisy Synthetic Data

Zaruhi Navasardyan, Spartak Bughdaryan, Bagrat Minasyan, Hrant Davtyan

PDF

Open Access 2 Models 4 Datasets 1 Video

TL;DR

This paper demonstrates that for low-resource languages, small-scale noisy synthetic data can effectively adapt text embeddings, achieving high performance with minimal data and challenging the need for large, pristine datasets.

Contribution

Introduces a cost-effective adaptation method using limited noisy synthetic data for low-resource languages, showing strong results and robustness over traditional large-scale approaches.

Findings

01

10,000 noisy synthetic pairs improve retrieval by 20%+

02

Performance matches models trained on 1 million examples

03

Increasing data quality or diversity yields limited gains

Abstract

Low-resource languages (LRLs) often lack high-quality, large-scale datasets for training effective text embedding models, hindering their application in tasks like retrieval-augmented generation (RAG) and semantic search. In this work, we challenge the prevailing assumption that effective semantic alignment requires massive datasets or pristine, human-verified translations. Focusing on Armenian (an LRL with a unique script), we introduce a cost-effective adaptation strategy using small scale noisy synthetic data generated by translating English Reddit title-body pairs with open-weights models. We establish a comprehensive evaluation benchmark comprising existing datasets, translated data, and a manually curated dataset. Our experiments reveal a surprising "Less is More" phenomenon: fine-tuning a multilingual encoder (mE5) on just 10,000 noisy synthetic pairs yields 11-12\% average…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

Less is More: Adapting Text Embeddings for Low-Resource Languages with Small Scale Noisy Synthetic Data· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Computational and Text Analysis Methods