Improving Romanian LLM Pretraining Data using Diversity and Quality Filtering

Vlad Negoita; Mihai Masala; Traian Rebedea

arXiv:2511.01090·cs.CL·November 4, 2025

Improving Romanian LLM Pretraining Data using Diversity and Quality Filtering

Vlad Negoita, Mihai Masala, Traian Rebedea

PDF

Open Access 1 Models 3 Datasets 1 Video

TL;DR

This paper enhances Romanian LLM pretraining by analyzing data characteristics and applying multi-level filtering to improve data quality, leading to better performance on benchmarks, addressing the scarcity of high-quality data for under-represented languages.

Contribution

It introduces a novel multi-level filtering approach for Romanian LLM pretraining data, improving data quality and model performance compared to previous datasets.

Findings

01

Filtering improves LLM performance on benchmarks

02

Romanian data has distinct topic coverage from English

03

Multi-level filtering enhances data quality

Abstract

Large Language Models (LLMs) have recently exploded in popularity, often matching or outperforming human abilities on many tasks. One of the key factors in training LLMs is the availability and curation of high-quality data. Data quality is especially crucial for under-represented languages, where high-quality corpora are scarce. In this work we study the characteristics and coverage of Romanian pretraining corpora and we examine how they differ from English data. By training a lightweight multitask model on carefully LLM-annotated Romanian texts, we are able to analyze and perform multi-level filtering (e.g., educational value, topic, format) to generate high-quality pretraining datasets. Our experiments show noteworthy trends in the topics present in Romanian and English data, while also proving the effectiveness of filtering data through improved LLM pretraining performance across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
OpenLLM-Ro/FineWeb2-RoEdu-Classifier
model

Datasets

Videos

Improving Romanian LLM Pretraining Data using Diversity and Quality Filtering· underline

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling