Pipeline Analysis for Developing Instruct LLMs in Low-Resource Languages: A Case Study on Basque
Ander Corral, Ixak Sarasua, Xabier Saralegi

TL;DR
This paper analyzes strategies for developing instruction-following large language models in Basque, a low-resource language, demonstrating significant improvements through targeted pre-training, instruction tuning, and human preference alignment.
Contribution
It provides a comprehensive pipeline for enhancing LLM performance in low-resource languages, with novel focus on Basque and detailed evaluation of each development stage.
Findings
Pre-training with 600 million Basque words improves NLU by over 12 points.
Instruction tuning and human preference alignment yield a 24-point boost in instruction-following.
Proposed models set new state-of-the-art for Basque in the sub-10B parameter range.
Abstract
Large language models (LLMs) are typically optimized for resource-rich languages like English, exacerbating the gap between high-resource and underrepresented languages. This work presents a detailed analysis of strategies for developing a model capable of following instructions in a low-resource language, specifically Basque, by focusing on three key stages: pre-training, instruction tuning, and alignment with human preferences. Our findings demonstrate that continual pre-training with a high-quality Basque corpus of around 600 million words improves natural language understanding (NLU) of the foundational model by over 12 points. Moreover, instruction tuning and human preference alignment using automatically translated datasets proved highly effective, resulting in a 24-point improvement in instruction-following performance. The resulting models, Llama-eus-8B and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsOpen Education and E-Learning · Natural Language Processing Techniques
