The UD-NewsCrawl Treebank: Reflections and Challenges from a Large-scale Tagalog Syntactic Annotation Project
Angelina A. Aquino, Lester James V. Miranda, Elsie Marie T. Or

TL;DR
This paper introduces UD-NewsCrawl, the largest Tagalog treebank with 15.6k annotated trees, and evaluates transformer models on Tagalog dependency parsing, highlighting unique linguistic challenges and implications for future research.
Contribution
It presents the creation of the largest Tagalog treebank and baseline evaluations, addressing challenges in syntactic analysis of Tagalog for NLP advancements.
Findings
Baseline transformer models achieve moderate parsing accuracy.
Tagalog's unique grammar poses specific challenges for dependency parsing.
The treebank provides a valuable resource for underrepresented language NLP research.
Abstract
This paper presents UD-NewsCrawl, the largest Tagalog treebank to date, containing 15.6k trees manually annotated according to the Universal Dependencies framework. We detail our treebank development process, including data collection, pre-processing, manual annotation, and quality assurance procedures. We provide baseline evaluations using multiple transformer-based models to assess the performance of state-of-the-art dependency parsers on Tagalog. We also highlight challenges in the syntactic analysis of Tagalog given its distinctive grammatical properties, and discuss its implications for the annotation of this treebank. We anticipate that UD-NewsCrawl and our baseline model implementations will serve as valuable resources for advancing computational linguistics research in underrepresented languages like Tagalog.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Translation Studies and Practices
