Low-Resource Dialect Adaptation of Large Language Models: A French Dialect Case-Study
Eeham Khan, Firas Saidani, Owen Van Esbroeck, Richard Khoury, Leila Kosseim

TL;DR
This study demonstrates that cost-effective continual pre-training with parameter-efficient fine-tuning can adapt large language models to low-resource dialects like Quebec French, improving minority dialect benchmarks with minimal impact on high-resource language performance.
Contribution
The paper introduces a practical approach using LoRA and CPT to adapt LLMs to low-resource dialects, providing new benchmarks and releasing Quebec French LLMs for wider access.
Findings
CPT improves dialect benchmarks with only 1% parameter updates.
Corpus composition significantly affects adaptation success.
Releasing Quebec French LLMs enhances accessibility for minority language communities.
Abstract
Despite the widespread adoption of Large Language Models (LLMs), their strongest capabilities remain largely confined to a small number of high-resource languages for which there is abundant training data. Recently, continual pre-training (CPT) has emerged as a means to fine-tune these models to low-resource regional dialects. In this paper, we study the use of CPT for dialect learning under tight data and compute budgets. Using low-rank adaptation (LoRA) and compute-efficient continual pre-training, we adapt three LLMs to the Qu\'ebec French dialect using a very small dataset and benchmark them on the COLE suite. Our experiments demonstrate an improvement on the minority dialect benchmarks with minimal regression on the prestige language benchmarks with around 1% of model parameters updated. Analysis of the results demonstrate that gains are highly contingent on corpus composition.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
