ks-pret-5m: a 5 million word, 12 million token kashmiri pretraining dataset
Haq Nawaz Malik, Nahfid Nissar

TL;DR
KS-PRET-5M is the largest publicly available Kashmiri pretraining dataset, enabling advancements in language modeling and linguistic research for Kashmiri through extensive, cleaned, and tokenized text data.
Contribution
This work introduces the first large-scale Kashmiri pretraining dataset with comprehensive cleaning, tokenization, and release for NLP research and model training.
Findings
Dataset contains over 5 million words and 12 million tokens.
Achieved high script purity with minimal Devanagari contamination.
Provides a valuable resource for Kashmiri language modeling and NLP.
Abstract
We present KS-PRET-5M, the largest publicly available pretraining dataset for the Kashmiri language, comprising 5,090,244 (5.09M) words, 27,692,959 (27.6M) characters, and a vocabulary of 295,433 (295.4K) unique word types. We assembled the dataset from two source classes: digitized archival and literary material, encompassing literature, news, biographies, novels, poetry, religious scholarship, and academic writing, recovered from the proprietary InPage desktop-publishing format using the converter of Malik~\cite{malik2024inpage}, and Unicode-native text collected from Kashmiri-language web sources. All text was processed through an eleven-stage cleaning pipeline that achieves a mean Kashmiri script ratio of 0.9965, reducing Devanagari contamination to 146 characters across the full dataset. We tokenized the dataset empirically using google/muril-base-cased, yielding a subword ratio of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
