ks-lit-3m: A 3.1 million word kashmiri text dataset for large language model pretraining
Haq Nawaz Malik

TL;DR
This paper introduces KS-LIT-3M, a 3.1 million word Kashmiri text dataset created through specialized conversion and preprocessing, aimed at enabling effective pretraining of language models for Kashmiri NLP tasks.
Contribution
The paper presents a large, high-quality Kashmiri text corpus and a novel InPage-to-Unicode converter, addressing critical data scarcity for Kashmiri language modeling.
Findings
Dataset contains 3.1 million words from diverse genres
Enables pretraining of Kashmiri language models
Addresses resource gap for Kashmiri NLP
Abstract
Large Language Models (LLMs) demonstrate remarkable fluency across high-resource languages yet consistently fail to generate coherent text in Kashmiri, a language spoken by approximately seven million people. This performance disparity stems not from inherent model limitations but from a critical scarcity of high-quality training data. Decades of Kashmiri literature remain inaccessible to modern NLP pipelines due to their encoding in the proprietary InPage desktop publishing format. This paper introduces KS-LIT-3M, a curated corpus of 3.1 million words (16.4 million characters) specifically designed for pretraining language models on Kashmiri. The dataset is structured as a single continuous linear text stream, optimized for causal language model training where models learn to predict subsequent tokens from preceding context. The corpus was constructed through the development of a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Hate Speech and Cyberbullying Detection
