Towards Open Foundation Language Model and Corpus for Macedonian: A Low-Resource Language
Stefan Krsteski, Matea Tashkovska, Borjan Sazdov, Hristijan Gjoreski, Branislav Gerazov

TL;DR
This paper introduces a comprehensive set of resources, including a large Macedonian corpus, instruction dataset, and evaluation benchmarks, to develop and assess a state-of-the-art language model tailored for Macedonian, a low-resource language.
Contribution
The authors created the largest Macedonian corpus, a culturally grounded instruction dataset, and trained a new 8B-parameter model that outperforms larger models on multiple benchmarks, all openly released.
Findings
Our model outperforms existing 8B models across benchmarks.
The model achieves performance comparable to models up to 10x larger.
Native speaker evaluations favor our model for grammatical correctness and cultural relevance.
Abstract
The increase in technological adoption worldwide comes with demands for novel tools to be used by the general population. Large Language Models (LLMs) provide a great opportunity in this respect, but their capabilities remain limited for low-resource languages, restricting applications in countries where such languages are spoken. We create several resources to facilitate the adoption of LLMs and to support research advancements for Macedonian. We collect the largest Macedonian corpus to date, consisting of 40GB of textual data and totaling 3.5B words. To support conversational applications, we collect a 106k-instance instruction dataset, carefully built to be culturally grounded. For evaluation, we construct a Macedonian evaluation suite covering seven benchmarks. Finally, we train domestic-yak, a state-of-the-art 8B-parameter model, on our curated datasets and evaluate it against…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
