Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque
Oscar Sainz, Naiara Perez, Julen Etxaniz, Joseba Fernandez de Landa, Itziar Aldabe, Iker Garc\'ia-Ferrero, Aimar Zabala, Ekhi Azurmendi, German Rigau, Eneko Agirre, Mikel Artetxe, Aitor Soroa

TL;DR
This study systematically investigates methods for adapting large language models to low-resource languages, specifically Basque, demonstrating that synthetic instructions and instruction-tuned backbones significantly improve performance without extensive language-specific data.
Contribution
The paper introduces a comprehensive experimental framework for low-resource language instruction tuning, highlighting the importance of corpora and synthetic instructions, and shows that instruction-tuned backbones outperform non-instructed models.
Findings
Target language corpora are essential for effective instruction tuning.
Synthetic instructions provide robustness in low-resource scenarios.
Instruction-tuned backbones outperform base non-instructed models.
Abstract
Instructing language models with user intent requires large instruction datasets, which are only available for a limited set of languages. In this paper, we explore alternatives to conventional instruction adaptation pipelines in low-resource scenarios. We assume a realistic scenario for low-resource languages, where only the following are available: corpora in the target language, existing open-weight multilingual base and instructed backbone LLMs, and synthetically generated instructions sampled from the instructed backbone. We present a comprehensive set of experiments for Basque that systematically study different combinations of these components evaluated on benchmarks and human preferences from 1,680 participants. Our conclusions show that target language corpora are essential, with synthetic instructions yielding robust models, and, most importantly, that using as backbone an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- HiTZ/Magpie-Llama-3.1-8B-Instruct-Unfiltereddataset· 30 dl30 dl
- HiTZ/Magpie-Llama-3.1-8B-Instruct-Filtereddataset· 47 dl47 dl
- HiTZ/Magpie-Llama-3.1-70B-Instruct-Unfiltereddataset· 33 dl33 dl
- HiTZ/Magpie-Llama-3.1-70B-Instruct-Filtereddataset· 70 dl70 dl
- HiTZ/Magpie-Llama-3.1-8B-Instruct-Filtered-translated-1Mdataset· 20 dl20 dl
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · ICT in Developing Communities
