LLiMba: Sardinian on a Single GPU -- Adapting a 3B Language Model to a Vanishing Romance Language
Luca Ballore

TL;DR
This paper introduces LLiMba, a Sardinian language model adapted from Qwen2.5-3B-Instruct, demonstrating effective low-resource language adaptation using a single GPU and various fine-tuning methods.
Contribution
It presents a novel approach for adapting large language models to low-resource languages like Sardinian with minimal hardware and compares multiple fine-tuning configurations.
Findings
rsLoRA r256 outperforms other fine-tuning methods in BLEU scores.
Adapter capacity significantly influences adaptation quality.
Translation metrics may not fully capture qualitative differences.
Abstract
Sardinian, a Romance language with roughly one million speakers, has minimal presence in modern NLP. Commercial services do not support it, and current language models do not produce it reliably. We present LLiMba, a 3B parameter Sardinian-ready model adapted from Qwen2.5-3B-Instruct through continued pretraining (CPT) and supervised fine-tuning (SFT) on a single 24 GB consumer GPU. The corpus contains 11.5 million tokens of Sardinian spanning LSC, Logudorese, and Campidanese, augmented with 2.4 million tokens of related Romance text as replay against register blurring. After CPT the model reaches a perplexity of 6.76 on held out Sardinian and outperforms the base across all six FLORES-200 directions. We compare five SFT configurations under matched conditions: full fine-tuning, LoRA r64, rsLoRA r128, rsLoRA r256, and DoRA r256. rsLoRA r256 wins on every direction into Sardinian,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
