BYOL: Bring Your Own Language Into LLMs
Syed Waqas Zamir, Wassim Hamidouche, Boulbaba Ben Amor, Luana Marotti, Inbal Becker-Reshef, Juan Lavista Ferres

TL;DR
BYOL introduces a scalable framework for developing language-aware large language models tailored to each language's digital presence, improving performance for low-resource languages through tailored pipelines and translation-mediated methods.
Contribution
The paper presents a novel unified framework that classifies languages by resource level and applies tailored data and model strategies, including translation methods, to improve LLM performance for low-resource languages.
Findings
Language-specific LLMs improve by 12% over multilingual baselines.
Translation-mediated inclusion enhances Inuktitut translation BLEU score by 4 points.
Public release of translated benchmarks and models supports further research.
Abstract
Large Language Models (LLMs) exhibit strong multilingual capabilities, yet remain fundamentally constrained by the severe imbalance in global language resources. While over 7,000 languages are spoken worldwide, only a small subset (fewer than 100) has sufficient digital presence to meaningfully influence modern LLM training. This disparity leads to systematic underperformance, cultural misalignment, and limited accessibility for speakers of low-resource and extreme-low-resource languages. To address this gap, we introduce Bring Your Own Language (BYOL), a unified framework for scalable, language-aware LLM development tailored to each language's digital footprint. BYOL begins with a language resource classification that maps languages into four tiers (Extreme-Low, Low, Mid, High) using curated web-scale corpora, and uses this classification to select the appropriate integration pathway.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Natural Language Processing Techniques · Text Readability and Simplification
