FuLG: 150B Romanian Corpus for Language Model Pretraining

Vlad-Andrei B\u{a}doiu; Mihai-Valentin Dumitru; Alexandru M.; Gherghescu; Alexandru Agache; Costin Raiciu

arXiv:2407.13657·cs.CL·July 19, 2024·2 cites

FuLG: 150B Romanian Corpus for Language Model Pretraining

Vlad-Andrei B\u{a}doiu, Mihai-Valentin Dumitru, Alexandru M., Gherghescu, Alexandru Agache, Costin Raiciu

PDF

Open Access 1 Datasets

TL;DR

FuLG is a large-scale Romanian corpus of 150 billion tokens, created from CommonCrawl, with a detailed filtering methodology and comparative analysis against existing Romanian datasets.

Contribution

This paper introduces FuLG, a massive Romanian corpus for language model pretraining, along with a novel filtering process and ablation studies demonstrating its effectiveness.

Findings

01

FuLG contains 150 billion tokens.

02

Filtering methodology improves corpus quality.

03

Compared favorably against existing Romanian corpora.

Abstract

Research in the field of language models is rapidly evolving, with many open models being released to the public. Openly available pretraining corpora usually focus on only a handful of languages, with many others either missing completely or extremely underrepresented. In this report, we introduce FuLG, a hundred-fifty-billion-token Romanian corpus extracted from CommonCrawl. We present our methodology for filtering FuLG and compare it via ablation studies against existing Romanian corpora.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

faur-ai/fulg
dataset· 13k dl
13k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsFocus