"Vorbe\c{s}ti Rom\^ane\c{s}te?" A Recipe to Train Powerful Romanian LLMs with English Instructions
Mihai Masala, Denis C. Ilie-Ablachim, Alexandru Dima, Dragos, Corlatescu, Miruna Zavelca, Ovio Olaru, Simina Terian, Andrei Terian, Marius, Leordeanu, Horia Velicu, Marius Popescu, Mihai Dascalu, Traian Rebedea

TL;DR
This paper presents a comprehensive approach to developing high-performance Romanian language models by collecting, translating, and training on diverse datasets, and releases resources to foster further research in low-resource languages.
Contribution
It introduces the first large-scale Romanian LLMs trained on translated and native data, along with a reproducible recipe applicable to other low-resource languages.
Findings
Achieved state-of-the-art results on Romanian benchmarks.
Demonstrated the effectiveness of translated and native datasets.
Provided open-source models and resources for Romanian NLP.
Abstract
In recent years, Large Language Models (LLMs) have achieved almost human-like performance on various tasks. While some LLMs have been trained on multilingual data, most of the training data is in English; hence, their performance in English greatly exceeds other languages. To our knowledge, we are the first to collect and translate a large collection of texts, instructions, and benchmarks and train, evaluate, and release open-source LLMs tailored for Romanian. We evaluate our methods on four different categories, including academic benchmarks, MT-Bench (manually translated), and a professionally built historical, cultural, and social benchmark adapted to Romanian. We argue for the usefulness and high performance of RoLLMs by obtaining state-of-the-art results across the board. We publicly release all resources (i.e., data, training and evaluation code, models) to support and encourage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗OpenLLM-Ro/RoLlama2-7b-Base-2024-05-14model· 9 dl· ♡ 49 dl♡ 4
- 🤗OpenLLM-Ro/RoLlama2-7b-Instruct-2024-05-14model· 31 dl· ♡ 831 dl♡ 8
- 🤗OpenLLM-Ro/RoMistral-7b-Instruct-2024-05-17model· 12 dl· ♡ 412 dl♡ 4
- 🤗OpenLLM-Ro/RoLlama3-8b-Instruct-2024-06-28model· 10 dl· ♡ 910 dl♡ 9
- 🤗OpenLLM-Ro/RoGemma-7b-Instruct-2024-06-28model· 5 dl· ♡ 15 dl♡ 1
- 🤗RichardErkhov/OpenLLM-Ro_-_RoLlama3-8b-Instruct-ggufmodel· 14 dl14 dl
- 🤗RichardErkhov/OpenLLM-Ro_-_RoLlama2-7b-Instruct-ggufmodel· 7 dl7 dl
- 🤗QuantFactory/RoLlama2-7b-Base-GGUFmodel· 59 dl· ♡ 259 dl♡ 2
- 🤗RichardErkhov/OpenLLM-Ro_-_RoMistral-7b-Instruct-ggufmodel· 12 dl12 dl
- 🤗OpenLLM-Ro/RoLlama2-7b-Instruct-2024-10-09model· 3 dl3 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTranslation Studies and Practices · Linguistics, Language Diversity, and Identity · Natural Language Processing Techniques
