600k-ks-ocr: a large-scale synthetic dataset for optical character recognition in kashmiri script
Haq Nawaz Malik

TL;DR
This paper introduces a large-scale synthetic dataset of over 600,000 Kashmiri script images, designed to advance OCR research for this endangered language with unique challenges.
Contribution
It provides the first extensive synthetic Kashmiri OCR dataset, including diverse fonts, augmentations, and ground-truths, filling a critical resource gap for low-resource language recognition.
Findings
Dataset enables training of OCR models for Kashmiri script.
Inclusion of diverse fonts and augmentations improves model robustness.
Open access facilitates further research in low-resource OCR.
Abstract
This technical report presents the 600K-KS-OCR Dataset, a large-scale synthetic corpus comprising approximately 602,000 word-level segmented images designed for training and evaluating optical character recognition systems targeting Kashmiri script. The dataset addresses a critical resource gap for Kashmiri, an endangered Dardic language utilizing a modified Perso-Arabic writing system spoken by approximately seven million people. Each image is rendered at 256x64 pixels with corresponding ground-truth transcriptions provided in multiple formats compatible with CRNN, TrOCR, and generalpurpose machine learning pipelines. The generation methodology incorporates three traditional Kashmiri typefaces, comprehensive data augmentation simulating real-world document degradation, and diverse background textures to enhance model robustness. The dataset is distributed across ten partitioned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Advanced Image and Video Retrieval Techniques · Natural Language Processing Techniques
