Bootstrapping Embeddings for Low Resource Languages
Merve Basoz, Andrew Horne, Mattia Opper

TL;DR
This paper explores methods to generate synthetic data for training language embeddings in low-resource languages, demonstrating that adapter composition and XL-LoRA significantly improve performance over existing approaches.
Contribution
The paper introduces two novel strategies, adapter composition and XL-LoRA, for generating synthetic data to enhance embedding models in low-resource languages.
Findings
Adapter composition and XL-LoRA outperform in-context learning.
Synthetic data approaches yield strong performance gains.
Methods provide scalable solutions for low-resource language NLP.
Abstract
Embedding models are crucial to modern NLP. However, the creation of the most effective models relies on carefully constructed supervised finetuning data. For high resource languages, such as English, such datasets are readily available. However, for hundreds of other languages, they are simply non-existent. We investigate whether the advent of large language models can help to bridge this gap. We test three different strategies for generating synthetic triplet data used to optimise embedding models. These include in-context learning as well as two novel approaches, leveraging adapter composition and cross lingual finetuning of the LLM generator (XL-LoRA) respectively. We find that while in-context learning still falls short of strong non-synthetic baselines, adapter composition and XL-LoRA yield strong performance gains across a wide array of tasks and languages, offering a clear,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗mbasoz/lora-gemma327b-xllora-negmodel· 15 dl15 dl
- 🤗mbasoz/lora-gemma327b-xllora-posmodel· 14 dl14 dl
- 🤗mbasoz/sentence-embeddings-xllora-mmbert-afrmodel· 13 dl13 dl
- 🤗mbasoz/sentence-embeddings-xllora-mmbert-haumodel· 22 dl22 dl
- 🤗mbasoz/sentence-embeddings-xllora-mmbert-hinmodel· 12 dl12 dl
- 🤗mbasoz/sentence-embeddings-xllora-mmbert-indmodel· 12 dl12 dl
- 🤗mbasoz/sentence-embeddings-xllora-mmbert-kormodel· 22 dl22 dl
- 🤗mbasoz/sentence-embeddings-xllora-mmbert-marmodel· 12 dl12 dl
- 🤗mbasoz/sentence-embeddings-xllora-mmbert-telmodel· 13 dl13 dl
- 🤗mbasoz/sentence-embeddings-xllora-xlmr-afrmodel· 12 dl12 dl
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Graph Neural Networks
