Building a Strong Instruction Language Model for a Less-Resourced Language
Domen Vre\v{s}, Tja\v{s}a Ar\v{c}on, Timotej Petri\v{c}, Dario Vajda, Marko Robnik-\v{S}ikonja, Iztok Lebar Bajec

TL;DR
This paper introduces GaMS3-12B, a 12-billion-parameter Slovene language model, developed through specialized adaptation techniques, outperforming existing models and rivaling larger commercial models in Slovene language tasks.
Contribution
The paper presents a novel methodology for adapting large language models to less-resourced languages, demonstrated with Slovene, achieving state-of-the-art performance among open-source models.
Findings
GaMS3-12B outperforms 12B Gemma 3 in all evaluation scenarios.
The model performs comparably to GPT-4o in Slovene tasks, with over 60% win rate.
Effective adaptation techniques enable LLMs to excel in less-resourced languages.
Abstract
Large language models (LLMs) have become an essential tool for natural language processing and artificial intelligence in general. Current open-source models are primarily trained on English texts, resulting in poorer performance on less-resourced languages and cultures. We present a set of methodological approaches necessary for the successful adaptation of an LLM to a less-resourced language, and demonstrate them using the Slovene language. We present GaMS3-12B, a generative model for Slovene with 12 billion parameters, and demonstrate that it is the best-performing open-source model for Slovene within its parameter range. We adapted the model to the Slovene language using three-stage continual pre-training of the Gemma 3 model, followed by two-stage supervised fine-tuning (SFT). We trained the model on a combination of 140B Slovene, English, Bosnian, Serbian, and Croatian pretraining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods
