Matina: A Large-Scale 73B Token Persian Text Corpus

Sara Bourbour Hosseinbeigi; Fatemeh Taherinezhad; Heshaam Faili; Hamed; Baghbani; Fatemeh Nadi; Mostafa Amiri

arXiv:2502.09188·cs.CL·February 14, 2025

Matina: A Large-Scale 73B Token Persian Text Corpus

Sara Bourbour Hosseinbeigi, Fatemeh Taherinezhad, Heshaam Faili, Hamed, Baghbani, Fatemeh Nadi, Mostafa Amiri

PDF

Open Access 1 Video

TL;DR

Matina is a large-scale, high-quality Persian text corpus of 73 billion tokens designed to support the development of NLP models and open-source LLMs for Persian, addressing previous resource limitations.

Contribution

This paper introduces the Matina corpus, a comprehensive 73-billion-token Persian dataset with preprocessing and deduplication, enhancing resources for Persian NLP research.

Findings

01

The dataset improves model training quality for Persian NLP tasks.

02

Transformer models trained on Matina achieve competitive performance.

03

Public availability fosters further research and development in Persian NLP.

Abstract

Text corpora are essential for training models used in tasks like summarization, translation, and large language models (LLMs). While various efforts have been made to collect monolingual and multilingual datasets in many languages, Persian has often been underrepresented due to limited resources for data collection and preprocessing. Existing Persian datasets are typically small and lack content diversity, consisting mainly of weblogs and news articles. This shortage of high-quality, varied data has slowed the development of NLP models and open-source LLMs for Persian. Since model performance depends heavily on the quality of training data, we address this gap by introducing the Matina corpus, a new Persian dataset of 72.9B tokens, carefully preprocessed and deduplicated to ensure high data quality. We further assess its effectiveness by training and evaluating transformer-based models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Matina: A Large-Scale 73B Token Persian Text Corpus· underline

Taxonomy

TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques