PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development

Hanif Rahman

arXiv:2603.16354·cs.CL·March 18, 2026

PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development

Hanif Rahman

PDF

Open Access

TL;DR

PashtoCorp is a large, high-quality Pashto language corpus that significantly improves NLP model performance for Pashto through extensive data, a reproducible pipeline, and comprehensive evaluation benchmarks.

Contribution

This work introduces PashtoCorp, the largest Pashto corpus to date, along with a reproducible pipeline and evaluation suite, advancing low-resource language NLP development.

Findings

01

Pretraining on PashtoCorp reduces XLM-R perplexity by 25.1%.

02

Pretrained models improve Pashto NER F1 by 10%.

03

PashtoCorp enables the first Pashto reading comprehension benchmark.

Abstract

We present PashtoCorp, a 1.25-billion-word corpus for Pashto, a language spoken by 60 million people that remains severely underrepresented in NLP. The corpus is assembled from 39 sources spanning seven HuggingFace datasets and 32 purpose-built web scrapers, processed through a reproducible pipeline with Arabic-script tokenization, SHA-256 deduplication, and quality filtering. At 1.25B words across 2.81 million documents, PashtoCorp is 40x larger than the OSCAR Pashto subset and 83x larger than the previously largest dedicated Pashto corpus. Continued MLM pretraining of XLM-R-base on PashtoCorp reduces held-out perplexity by 25.1% (8.08->6.06). On WikiANN Pashto NER, the pretrained model improves entity F1 by 10% relative (19.0%->21.0%) and reduces training variance nearly 7x; the largest gain appears at 50 training sentences (+27%), with PashtoCorp covering 97.9% of WikiANN entity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification