EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training
Aleksei Dorkin, Taido Purason, Emil Kalbaliyev, Hele-Andra Kuulmets, Marii Ojastu, Mark Fi\v{s}el, Tanel Alum\"ae, Eleri Aedmaa, Krister Kruusmaa, Kairit Sirts

TL;DR
This paper demonstrates that continued pretraining and post-training alignment can significantly improve Estonian language capabilities in a multilingual LLM without sacrificing English performance.
Contribution
It introduces a method combining CPT and post-training alignment to enhance Estonian skills in a multilingual LLM while maintaining overall performance.
Findings
Significant improvements in Estonian linguistic competence and reasoning.
Enhanced translation quality and instruction-following in Estonian.
Maintained competitive performance on English benchmarks.
Abstract
Large language models (LLMs) are predominantly trained on English-centric data, resulting in uneven performance for smaller languages. We study whether continued pretraining (CPT) can substantially improve Estonian capabilities in a pretrained multilingual LLM while preserving its English and general reasoning performance. Using Llama 3.1 8B as the main base model, we perform CPT on a mixture that increases Estonian exposure while approximating the original training distribution through English replay and the inclusion of code, mathematics, and instruction-like data. We subsequently apply supervised fine-tuning, preference optimization, and chat vector merging to introduce robust instruction-following behavior. Evaluation on a comprehensive suite of Estonian benchmarks shows consistent gains in linguistic competence, knowledge, reasoning, translation quality, and instruction-following…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗tartuNLP/llama-estllm-prototype-0825model· 69 dl· ♡ 369 dl♡ 3
- 🤗tartuNLP/Llama-3.1-EstLLM-8B-Instruct-1125model· 2.7k dl· ♡ 52.7k dl♡ 5
- 🤗tartuNLP/Llama-3.1-EstLLM-8B-0525model· 550 dl550 dl
- 🤗tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825model· 34 dl· ♡ 234 dl♡ 2
- 🤗tartuNLP/Apertus-EstLLM-8B-Instruct-1125model· 46 dl· ♡ 146 dl♡ 1
- 🤗tartuNLP/Apertus-EstLLM-8B-1125model· 444 dl444 dl
- 🤗tartuNLP/Apertus-EstLLM-8B-Instruct-0326model· 317 dl317 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Big Data and Digital Economy
