Fineweb-Edu-Ar: Machine-translated Corpus to Support Arabic Small   Language Models

Sultan Alrashed; Dmitrii Khizbullin; David R. Pugh

arXiv:2411.06402·cs.CL·November 12, 2024

Fineweb-Edu-Ar: Machine-translated Corpus to Support Arabic Small Language Models

Sultan Alrashed, Dmitrii Khizbullin, David R. Pugh

PDF

Open Access 1 Datasets

TL;DR

This paper introduces FineWeb-Edu-Ar, the largest publicly available machine-translated Arabic dataset, created to support the development of small Arabic language models by leveraging high-quality English data.

Contribution

It presents a new large-scale machine-translated Arabic dataset, expanding resources for low-resource language modeling and facilitating the training of small Arabic language models.

Findings

01

Largest publicly available machine-translated Arabic dataset

02

Contains 202 billion tokens with an Arabic-trained tokenizer

03

Supports development of small Arabic language models

Abstract

As large language models (LLMs) grow and develop, so do their data demands. This is especially true for multilingual LLMs, where the scarcity of high-quality and readily available data online has led to a multitude of synthetic dataset generation approaches. A key technique in this space is machine translation (MT), where high-quality English text is adapted to a target, comparatively low-resource language. This report introduces FineWeb-Edu-Ar, a machine-translated version of the exceedingly popular (deduplicated) FineWeb-Edu dataset from HuggingFace. To the best of our knowledge, FineWeb-Edu-Ar is the largest publicly available machine-translated Arabic dataset out there, with its size of 202B tokens of an Arabic-trained tokenizer.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

kaust-generative-ai/fineweb-edu-ar
dataset· 2.0k dl
2.0k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques