naab: A ready-to-use plug-and-play corpus for Farsi

Sadra Sabouri; Elnaz Rahmati; Soroush Gooran; Hossein Sameti

arXiv:2208.13486·cs.CL·December 24, 2024·1 cites

naab: A ready-to-use plug-and-play corpus for Farsi

Sadra Sabouri, Elnaz Rahmati, Soroush Gooran, Hossein Sameti

PDF

Open Access 2 Datasets

TL;DR

This paper introduces naab, a large, clean, and publicly available Farsi corpus designed to enhance NLP research and model performance in low-resource languages, supporting the development of better language models for Farsi.

Contribution

We present naab, the largest ready-to-use Farsi corpus, along with tools for preprocessing, enabling improved NLP research and model training for low-resource languages.

Findings

01

Naab contains 130GB of data with 250 million paragraphs.

02

The corpus is openly accessible via Hugging Face.

03

Preprocessing toolkit supports custom data cleaning.

Abstract

The rise of large language models (LLMs) has transformed numerous natural language processing (NLP) tasks, yet their performance in low and mid-resource languages, such as Farsi, still lags behind resource-rich languages like English. To address this gap, we introduce naab, the largest publicly available, cleaned, and ready-to-use Farsi textual corpus. naab consists of 130GB of data, comprising over 250 million paragraphs and 15 billion words. Named after the Farsi word NAAB (meaning "pure" or "high-grade"), this corpus is openly accessible via Hugging Face, offering researchers a valuable resource for Farsi NLP tasks. In addition to naab, we provide naab-raw, an unprocessed version of the dataset, along with a pre-processing toolkit that allows users to clean their custom corpora. These resources empower NLP researchers and practitioners, particularly those focusing on low-resource…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification