ks-pret-5m: a 5 million word, 12 million token kashmiri pretraining dataset

Haq Nawaz Malik; Nahfid Nissar

arXiv:2604.11066·cs.CL·April 14, 2026

ks-pret-5m: a 5 million word, 12 million token kashmiri pretraining dataset

Haq Nawaz Malik, Nahfid Nissar

PDF

1 Datasets

TL;DR

KS-PRET-5M is the largest publicly available Kashmiri pretraining dataset, enabling advancements in language modeling and linguistic research for Kashmiri through extensive, cleaned, and tokenized text data.

Contribution

This work introduces the first large-scale Kashmiri pretraining dataset with comprehensive cleaning, tokenization, and release for NLP research and model training.

Findings

01

Dataset contains over 5 million words and 12 million tokens.

02

Achieved high script purity with minimal Devanagari contamination.

03

Provides a valuable resource for Kashmiri language modeling and NLP.

Abstract

We present KS-PRET-5M, the largest publicly available pretraining dataset for the Kashmiri language, comprising 5,090,244 (5.09M) words, 27,692,959 (27.6M) characters, and a vocabulary of 295,433 (295.4K) unique word types. We assembled the dataset from two source classes: digitized archival and literary material, encompassing literature, news, biographies, novels, poetry, religious scholarship, and academic writing, recovered from the proprietary InPage desktop-publishing format using the converter of Malik~\cite{malik2024inpage}, and Unicode-native text collected from Kashmiri-language web sources. All text was processed through an eleven-stage cleaning pipeline that achieves a mean Kashmiri script ratio of 0.9965, reducing Devanagari contamination to 146 characters across the full dataset. We tokenized the dataset empirically using google/muril-base-cased, yielding a subword ratio of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Omarrran/KS-PRET-5M_5_million_kashmiri_Pretrainning_LLM_dataset_12M_tokens_2026
dataset· 32 dl
32 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.