LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data

Julian Valline; Cedric Lothritz; Siwen Guo; Jordi Cabot

arXiv:2510.24434·cs.CL·March 31, 2026

LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data

Julian Valline, Cedric Lothritz, Siwen Guo, Jordi Cabot

PDF

TL;DR

LuxIT is a new Luxembourgish instruction tuning dataset created from monolingual texts, which improves the performance of smaller LLMs on language exams and NLP tasks.

Contribution

The paper introduces LuxIT, a high-quality monolingual instruction dataset for Luxembourgish, and demonstrates its effectiveness in enhancing LLM performance in low-resource settings.

Findings

01

Training on LuxIT improves language exam accuracy by +5.37 percentage points.

02

Most models show improved macro-averaged F1 scores on NLP tasks.

03

Synthetic monolingual data can effectively boost LLM capabilities in low-resource languages.

Abstract

The effectiveness of instruction-tuned Large Language Models (LLMs) is often limited in low-resource linguistic settings due to a lack of high-quality training data. We introduce LuxIT, a novel, monolingual instruction tuning dataset for Luxembourgish developed to mitigate this challenge. We synthesize the dataset from a corpus of native Luxembourgish texts, utilizing DeepSeek-R1-0528, chosen for its shown proficiency in Luxembourgish. Following generation, we apply a quality assurance process, employing an LLM-as-a-judge approach, retaining 227,507 high-quality instruction-answer pairs. To investigate the practical utility of the dataset, we fine-tune 14 smaller-scale LLMs ( $\leq$ 15B parameters) on LuxIT and evaluate them on standardized Luxembourgish proficiency exams and five downstream NLP tasks. Training on LuxIT yields a mean accuracy change of +5.37 percentage points on language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.