LLMic: Romanian Foundation Language Model
Vlad-Andrei B\u{a}doiu, Mihai-Valentin Dumitru, Alexandru M., Gherghescu, Alexandru Agache, Costin Raiciu

TL;DR
The paper introduces LLMic, a bilingual Romanian language model, demonstrating effective pretraining, fine-tuning, and superior translation performance for a low-resource language using a smaller model.
Contribution
It presents the complete process of developing a Romanian-focused language model, including corpus creation, architecture choice, and hyper-parameter tuning, with improved translation results.
Findings
LLMic achieves comparable performance to larger models.
Fine-tuning enhances translation quality significantly.
The approach benefits low-resource language processing.
Abstract
Recent advances in Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks with commercial models leading the way. While open models usually operate at a smaller scale, they maintain competitiveness through specialization and fine-tuning. However, a significant challenge persists: open models often underperform in low-resource languages due to limited representation in the training corpus. In this paper, we present LLMic, a bilingual foundation language model designed specifically for the Romanian Language. We document the complete process of pretraining a foundation model for a low-resource language, including corpus construction, architecture selection, and hyper-parameter optimization. Our evaluation demonstrates that LLMic can be specialized for tasks in the target language, achieving results comparable to other much larger open models. We show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
