LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data
Julian Valline, Cedric Lothritz, Siwen Guo, Jordi Cabot

TL;DR
LuxIT is a new Luxembourgish instruction tuning dataset created from monolingual texts, which improves the performance of smaller LLMs on language exams and NLP tasks.
Contribution
The paper introduces LuxIT, a high-quality monolingual instruction dataset for Luxembourgish, and demonstrates its effectiveness in enhancing LLM performance in low-resource settings.
Findings
Training on LuxIT improves language exam accuracy by +5.37 percentage points.
Most models show improved macro-averaged F1 scores on NLP tasks.
Synthetic monolingual data can effectively boost LLM capabilities in low-resource languages.
Abstract
The effectiveness of instruction-tuned Large Language Models (LLMs) is often limited in low-resource linguistic settings due to a lack of high-quality training data. We introduce LuxIT, a novel, monolingual instruction tuning dataset for Luxembourgish developed to mitigate this challenge. We synthesize the dataset from a corpus of native Luxembourgish texts, utilizing DeepSeek-R1-0528, chosen for its shown proficiency in Luxembourgish. Following generation, we apply a quality assurance process, employing an LLM-as-a-judge approach, retaining 227,507 high-quality instruction-answer pairs. To investigate the practical utility of the dataset, we fine-tune 14 smaller-scale LLMs (15B parameters) on LuxIT and evaluate them on standardized Luxembourgish proficiency exams and five downstream NLP tasks. Training on LuxIT yields a mean accuracy change of +5.37 percentage points on language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
