Urdu Dependency Parsing and Treebank Development: A Syntactic and   Morphological Perspective

Nudrat Habib

arXiv:2406.09549·cs.CL·October 3, 2024

Urdu Dependency Parsing and Treebank Development: A Syntactic and Morphological Perspective

Nudrat Habib

PDF

Open Access

TL;DR

This paper develops a dependency parser for Urdu, a low-resource language with complex morphology, by creating a new treebank and applying feature-based models, achieving promising accuracy results.

Contribution

It introduces the first Urdu dependency treebank and demonstrates effective parsing models tailored for Urdu's syntactic and morphological features.

Findings

01

Achieved 70% labeled accuracy (LA)

02

Achieved 84% unlabeled attachment score (UAS)

03

Validated the feasibility of dependency parsing for Urdu

Abstract

Parsing is the process of analyzing a sentence's syntactic structure by breaking it down into its grammatical components. and is critical for various linguistic applications. Urdu is a low-resource, free word-order language and exhibits complex morphology. Literature suggests that dependency parsing is well-suited for such languages. Our approach begins with a basic feature model encompassing word location, head word identification, and dependency relations, followed by a more advanced model integrating part-of-speech (POS) tags and morphological attributes (e.g., suffixes, gender). We manually annotated a corpus of news articles of varying complexity. Using Maltparser and the NivreEager algorithm, we achieved a best-labeled accuracy (LA) of 70% and an unlabeled attachment score (UAS) of 84%, demonstrating the feasibility of dependency parsing for Urdu.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques