# Pahari POS-tagged corpus: A large-scale linguistic resource for NLP applications

**Authors:** Nadia Mushtaq Gardazi, Muhammad Kamran Malik, Ali Daud

PMC · DOI: 10.1016/j.dib.2026.112515 · Data in Brief · 2026-02-03

## TL;DR

This paper introduces a large POS-tagged dataset for Pahari, an under-resourced language, to support NLP research and applications.

## Contribution

The creation of the first POS-tagged corpus for Pahari with a tailored tag set and high inter-annotator agreement.

## Key findings

- A 200,000-token Pahari corpus was manually annotated with a POS tag set derived from Indo-Aryan languages.
- The annotation achieved an inter-annotator agreement of 92.3% using Cohen’s Kappa.
- The dataset provides a foundation for future NLP tasks like NER and morphosyntactic analysis in Pahari.

## Abstract

This paper presents the development of a Part-of-Speech (POS) tagged dataset for Pahari, an under-resourced Indo-Aryan language spoken in Azad Jammu and Kashmir, Pakistan, as well as parts of India and Nepal. The lack of linguistic resources for Pahari has hindered the advancement of Natural Language Processing (NLP) tools, limiting its computational analysis. This study addresses this gap by creating a POS-tagged dataset, defining a tag set tailored to Pahari, and establishing annotation guidelines. The Pahari POS tag set was designed by leveraging existing tag sets from Urdu, Hindi, Punjabi, and other Indo-Aryan languages, ensuring linguistic compatibility. A corpus of 200,000 tokens was collected and manually annotated, achieving an inter-annotator agreement of 92.3 % (Cohen’s Kappa). This paper explores the key challenges faced during data collection, preprocessing, and annotation, and details the methodologies employed to address them. The resulting dataset represents the first structured linguistic resource developed for Natural Language Processing (NLP) in the Pahari language. It lays a critical foundation for future research in areas such as morphosyntactic analysis, Named Entity Recognition (NER), and the development of machine learning-based NLP applications.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12992506/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12992506/full.md

## References

9 references — full list in the complete paper: https://tomesphere.com/paper/PMC12992506/full.md

---
Source: https://tomesphere.com/paper/PMC12992506