Improving Romanian LLM Pretraining Data using Diversity and Quality Filtering
Vlad Negoita, Mihai Masala, Traian Rebedea

TL;DR
This paper enhances Romanian LLM pretraining by analyzing data characteristics and applying multi-level filtering to improve data quality, leading to better performance on benchmarks, addressing the scarcity of high-quality data for under-represented languages.
Contribution
It introduces a novel multi-level filtering approach for Romanian LLM pretraining data, improving data quality and model performance compared to previous datasets.
Findings
Filtering improves LLM performance on benchmarks
Romanian data has distinct topic coverage from English
Multi-level filtering enhances data quality
Abstract
Large Language Models (LLMs) have recently exploded in popularity, often matching or outperforming human abilities on many tasks. One of the key factors in training LLMs is the availability and curation of high-quality data. Data quality is especially crucial for under-represented languages, where high-quality corpora are scarce. In this work we study the characteristics and coverage of Romanian pretraining corpora and we examine how they differ from English data. By training a lightweight multitask model on carefully LLM-annotated Romanian texts, we are able to analyze and perform multi-level filtering (e.g., educational value, topic, format) to generate high-quality pretraining datasets. Our experiments show noteworthy trends in the topics present in Romanian and English data, while also proving the effectiveness of filtering data through improved LLM pretraining performance across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
