InstructLR: A Scalable Approach to Create Instruction Dataset for Under-Resourced Languages
Mamadou K. Keita, Sebastien Diarra, Christopher Homan, Seydou Diallo

TL;DR
InstructLR is a new framework that combines large language models and human validation to create high-quality instruction datasets for low-resource languages, enabling better language support.
Contribution
It introduces a scalable, dual-layer filtering approach for generating instruction datasets in under-resourced languages, filling a critical gap in language model training resources.
Findings
Created three multi-domain instruction benchmarks for LRLs
Achieved high-quality, fluent instruction datasets with human validation
Demonstrated effectiveness of LLM-driven data generation with filtering
Abstract
Effective text generation and chat interfaces for low-resource languages (LRLs) remain a challenge for state-of-the-art large language models (LLMs) to support. This is mainly due to the difficulty of curating high-quality instruction datasets for LRLs, a limitation prevalent in the languages spoken across the African continent and other regions. Current approaches, such as automated translation and synthetic data generation, frequently yield outputs that lack fluency or even orthographic consistency. In this paper, we introduce InstructLR, a novel framework designed to generate high-quality instruction datasets for LRLs. Our approach integrates LLM-driven text generation with a dual-layer quality filtering mechanism: an automated filtering layer based on retrieval-augmented-generation (RAG)-based n-shot prompting, and a human-in-the-loop validation layer. Drawing inspiration from…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- **Timely and important topic:** Addressing LLM accessibility for under-resourced languages is a highly relevant problem with social and scientific impact. - **Complete and scalable approach:** The paper presents an end-to-end framework, from seed instruction generation to human validation, which is reusable across languages and domains. - **Clarity and reproducibility:** The pipeline is clearly described and supported by well-chosen examples and figures. The authors also emphasize cost-eff
- **Limited novelty:** While the framework integrates translation, RAG-based filtering, and human validation effectively, these components are individually standard. The main contribution is the *composition* of these techniques rather than a new algorithmic insight. - **Experimental focus:** The experiments primarily show that models fine-tuned on the resulting datasets perform better than baselines. This is known fact. However, they do not deeply analyze *the pipeline itself*—for instance, h
1, Tackles multilingual equity by addressing a pressing issue: lack of instruction datasets for African and other under-resourced languages. 2, The dual-layer filtering pipeline (RAG-based automatic correction + human validation) is novel and pragmatic. 3, Framework demonstrated across three distinct LRLs, showing language-agnostic and reusable properties. 4, Quantitative gains (BLEU +20–30, ROUGE, METEOR) and human preference results clearly substantiate claims.
1, Relies on Gemini and GPT-4o for initial generation; this undermines reproducibility and scalability in low-resource contexts. 2, All three LRLs are French-contact African languages, so claims of language-agnosticism remain under-tested. 3, Only five Zarma and one Bambara annotators—too few to ensure dialectal or sociolinguistic representativeness. 4, The dual-layer filtering ensures fluency but not factual correctness, leaving potential hallucination propagation unaddressed. 5, The framework
- This paper investigates an important problem: how to create a large amount of high quality instruction following samples in low-resource languages? - The instruction following datasets generated in three low-resource languages will be helpful to the low-resource NLP community.
- **Limited Generalization**: This pipeline involves a large language model with reasonable performance on the low-resource language and some human experts for evaluation and correction, which makes it hard to scale and generalize to some low-resource languages. Given a low-resource language, this pipeline may be not applicable for all large language models performing bad or none suitable human experts. On the other hand, the number of instruction following samples is constrained by the budget t
Videos
Taxonomy
TopicsICT in Developing Communities · Natural Language Processing Techniques · Speech Recognition and Synthesis
