LLMic: Romanian Foundation Language Model

Vlad-Andrei B\u{a}doiu; Mihai-Valentin Dumitru; Alexandru M.; Gherghescu; Alexandru Agache; Costin Raiciu

arXiv:2501.07721·cs.CL·January 15, 2025

LLMic: Romanian Foundation Language Model

Vlad-Andrei B\u{a}doiu, Mihai-Valentin Dumitru, Alexandru M., Gherghescu, Alexandru Agache, Costin Raiciu

PDF

Open Access 2 Models 5 Datasets

TL;DR

The paper introduces LLMic, a bilingual Romanian language model, demonstrating effective pretraining, fine-tuning, and superior translation performance for a low-resource language using a smaller model.

Contribution

It presents the complete process of developing a Romanian-focused language model, including corpus creation, architecture choice, and hyper-parameter tuning, with improved translation results.

Findings

01

LLMic achieves comparable performance to larger models.

02

Fine-tuning enhances translation quality significantly.

03

The approach benefits low-resource language processing.

Abstract

Recent advances in Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks with commercial models leading the way. While open models usually operate at a smaller scale, they maintain competitiveness through specialization and fine-tuning. However, a significant challenge persists: open models often underperform in low-resource languages due to limited representation in the training corpus. In this paper, we present LLMic, a bilingual foundation language model designed specifically for the Romanian Language. We document the complete process of pretraining a foundation model for a low-resource language, including corpus construction, architecture selection, and hyper-parameter optimization. Our evaluation demonstrates that LLMic can be specialized for tasks in the target language, achieving results comparable to other much larger open models. We show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques