ks-lit-3m: A 3.1 million word kashmiri text dataset for large language model pretraining

Haq Nawaz Malik

arXiv:2601.01091·cs.CL·January 6, 2026

ks-lit-3m: A 3.1 million word kashmiri text dataset for large language model pretraining

Haq Nawaz Malik

PDF

Open Access

TL;DR

This paper introduces KS-LIT-3M, a 3.1 million word Kashmiri text dataset created through specialized conversion and preprocessing, aimed at enabling effective pretraining of language models for Kashmiri NLP tasks.

Contribution

The paper presents a large, high-quality Kashmiri text corpus and a novel InPage-to-Unicode converter, addressing critical data scarcity for Kashmiri language modeling.

Findings

01

Dataset contains 3.1 million words from diverse genres

02

Enables pretraining of Kashmiri language models

03

Addresses resource gap for Kashmiri NLP

Abstract

Large Language Models (LLMs) demonstrate remarkable fluency across high-resource languages yet consistently fail to generate coherent text in Kashmiri, a language spoken by approximately seven million people. This performance disparity stems not from inherent model limitations but from a critical scarcity of high-quality training data. Decades of Kashmiri literature remain inaccessible to modern NLP pipelines due to their encoding in the proprietary InPage desktop publishing format. This paper introduces KS-LIT-3M, a curated corpus of 3.1 million words (16.4 million characters) specifically designed for pretraining language models on Kashmiri. The dataset is structured as a single continuous linear text stream, optimized for causal language model training where models learn to predict subsequent tokens from preceding context. The corpus was constructed through the development of a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Hate Speech and Cyberbullying Detection