A Second Wave of UD Hebrew Treebanking and Cross-Domain Parsing

Amir Zeldes; Nick Howell; Noam Ordan; Yifat Ben Moshe

arXiv:2210.07873·cs.CL·October 19, 2022

A Second Wave of UD Hebrew Treebanking and Cross-Domain Parsing

Amir Zeldes, Nick Howell, Noam Ordan, Yifat Ben Moshe

PDF

Open Access 2 Repos 1 Datasets

TL;DR

This paper introduces a new, diverse Hebrew UD treebank from Wikipedia, evaluates its quality, and demonstrates improved cross-domain parsing performance with state-of-the-art results using advanced language models.

Contribution

It presents a new Hebrew UD treebank from Wikipedia, updates the annotation scheme, and conducts the first cross-domain parsing experiments in Hebrew.

Findings

01

Achieved state-of-the-art results on Hebrew UD NLP tasks

02

Validated annotation quality with automatic tools

03

Demonstrated improved cross-domain parsing performance

Abstract

Foundational Hebrew NLP tasks such as segmentation, tagging and parsing, have relied to date on various versions of the Hebrew Treebank (HTB, Sima'an et al. 2001). However, the data in HTB, a single-source newswire corpus, is now over 30 years old, and does not cover many aspects of contemporary Hebrew on the web. This paper presents a new, freely available UD treebank of Hebrew stratified from a range of topics selected from Hebrew Wikipedia. In addition to introducing the corpus and evaluating the quality of its annotations, we deploy automatic validation tools based on grew (Guillaume, 2021), and conduct the first cross domain parsing experiments in Hebrew. We obtain new state-of-the-art (SOTA) results on UD NLP tasks, using a combination of the latest language modelling and some incremental improvements to existing transformer based approaches. We also release a new version of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

iahlt/UD_Hebrew-IAHLTwiki
dataset· 46 dl
46 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification