What's Wrong with Hebrew NLP? And How to Make it Right
Reut Tsarfaty, Amit Seker, Shoval Sadde, Stav Klein

TL;DR
This paper introduces Onlp, a joint morpho-syntactic parser for Modern Hebrew that improves accuracy by reducing error propagation, addressing challenges faced by NLP tools in morphologically-rich languages.
Contribution
The paper presents a novel joint inference framework for Hebrew NLP that enhances accuracy and provides rich output, filling a gap in tools for morphologically-rich languages.
Findings
Onlp achieves high accuracy in Hebrew morphological and syntactic parsing.
Joint inference reduces error propagation compared to pipeline approaches.
The tool supports diverse academic and commercial applications.
Abstract
For languages with simple morphology, such as English, automatic annotation pipelines such as spaCy or Stanford's CoreNLP successfully serve projects in academia and the industry. For many morphologically-rich languages (MRLs), similar pipelines show sub-optimal performance that limits their applicability for text analysis in research and the industry.The sub-optimal performance is mainly due to errors in early morphological disambiguation decisions, which cannot be recovered later in the pipeline, yielding incoherent annotations on the whole. In this paper we describe the design and use of the Onlp suite, a joint morpho-syntactic parsing framework for processing Modern Hebrew texts. The joint inference over morphology and syntax substantially limits error propagation, and leads to high accuracy. Onlp provides rich and expressive output which already serves diverse academic and…
| Tok | MA | MD | POS | Lem | Feats | Deps | Joint | |
| Tasks | ||||||||
| MILA | ||||||||
| NITE | ||||||||
| Hebrew-NLP | ||||||||
| Adler | ✓ | ✓ | ||||||
| Goldberg | ✓ | |||||||
| Pipelines | ||||||||
| UDPipe | ||||||||
| CoreNLP | ||||||||
| ONLP |
| From | To | Form | Lemma | Part of Speech | Features | Token Number |
|---|---|---|---|---|---|---|
| 0 | 1 | \cjRLh | \cjRLh | DEF | _ | 1 |
| 0 | 3 | \cjRLh | \cjRLh | REL | _ | 1 |
| 0 | 5 | \cjRLhbn | \cjRLhbyn | VB | gen=M,num=S,per=2,tense=IMPERATIVE | 1 |
| 1 | 2 | \cjRLb | \cjRLb | IN | _ | 1 |
| 1 | 5 | \cjRLbn | \cjRLbn | NNP | gen=M,num=S | 1 |
| 1 | 5 | \cjRLbn | \cjRLbn | NNT | gen=M,num=S | 1 |
| 1 | 5 | \cjRLbn | \cjRLbn | NN | gen=M,num=S | 1 |
| 2 | 5 | \cjRLhn | \cjRLhn | S_PRN | gen=F,num=P,per=3 | 1 |
| 3 | 4 | \cjRLb | \cjRLb | IN | _ | 1 |
| 3 | 5 | \cjRLbn | \cjRLbn | NNP | gen=M,num=S | 1 |
| 3 | 5 | \cjRLbn | \cjRLbn | NNT | gen=M,num=S | 1 |
| 3 | 5 | \cjRLbn | \cjRLbn | NN | gen=M,num=S | 1 |
| 4 | 5 | \cjRLhn | \cjRLhn | S_PRN | gen=F,num=P,per=3 | 1 |
| 5 | 6 | \cjRL/s | \cjRL/s | REL | _ | 2 |
| 5 | 7 | \cjRL/snm | \cjRL/sn | NN | gen=F,num=S,suf_gen=M,suf_num=P,suf_per=3 | 2 |
| 6 | 7 | \cjRLnm | \cjRLnm | VB | gen=M,num=S,per=A,tense=BEINONI | 2 |
| 6 | 7 | \cjRLnm | \cjRLnm | BNT | gen=M,num=S,per=A | 2 |
| 6 | 7 | \cjRLnm | \cjRLnm | BN | gen=M,num=S,per=A | 2 |
| 6 | 7 | \cjRLnm | \cjRLnm | VB | gen=M,num=S,per=3,tense=PAST | 2 |
| 7 | 8 | \cjRLb | \cjRLb | PREPOSITION | _ | 3 |
| 7 | 10 | \cjRLb.sl | \cjRLb.sl | NN | gen=M,num=S | 3 |
| 7 | 10 | \cjRLb.sl | \cjRLb.sl | NNT | gen=M,num=S | 3 |
| 8 | 9 | \cjRLh | \cjRLh | DEF | _ | 3 |
| 8 | 10 | \cjRL.sl | \cjRL.sl | NN | gen=M,num=S | 3 |
| 8 | 10 | \cjRL.sl | \cjRL.sl | NNT | gen=M,num=S | 3 |
| 9 | 10 | \cjRL.sl | \cjRL.sl | NNT | gen=M,num=S | 3 |
| 9 | 10 | \cjRL.sl | \cjRL.sl | NN | gen=M,num=S | 3 |
| POS | Definition | Example |
|---|---|---|
| ADVERB | The word \cjRLk*: before numerals | \cjRLk*:mylywn |
| AT | The accusative marker \cjRL’t which is a seperate word in Hebrew | \cjRL’t hklb |
| BN | Participle (Beinoni) | \cjRLmgy‘ym |
| BNT | Participle in construct state form | \cjRLmqymy h‘ytwn |
| CC | Conjunction | \cjRL’l’ |
| REL | Relative clause marker | \cjRL/s: |
| CD | Numeral | \cjRLm’wt |
| CDT | Numeral in construct state | \cjRL’lpy |
| CONJ | Coordinating conjunction \cjRLw | \cjRLw: |
| COP | Copula | \cjRLhyh |
| DEF | A special tag assigned to the definite marker \cjRLh which appears with nouns, adjectives and numerals | \cjRLh |
| DTT | Determiner | \cjRLkl |
| DUMMY_AT | Accusative marker \cjRL’t when used with a pronominal suffix | \cjRL’wtw |
| EX | The existential markers \cjRLy/s or \cjRL’yn | \cjRLy/s |
| IN | Preposition | \cjRL‘d |
| INTJ | Interjection | \cjRLn’ |
| JJ | Adjective | \cjRLzrym |
| JJT | Adjective in construct state | \cjRLypy np/s |
| MD | Modal predicates | \cjRL.sryK |
| NN | Noun | \cjRL.hbr |
| NN_S_PP | Noun with a pronominal suffix | \cjRLpw‘lyhM |
| NNP | Proper Noun | \cjRLn‘my |
| NNT | Construct state noun | \cjRLh‘sqt |
| P | Prefix written as a separate word | \cjRLblty |
| POS | Possessive preposition \cjRL/sl | \cjRL/sl |
| PREPOSITION | Inseperable preposition | \cjRLb*: |
| PRP | Personal Pronoun | \cjRLhy’ |
| S_PRP | Reflexive pronoun | \cjRL‘.smy |
| QW | Question word | \cjRLky.sd |
| S_PRN | Personal pronoun attached to a preposition as a pronominal suffix | \cjRL’wtnw |
| TEMP | Subordinating conjunction introducing time clauses | \cjRLk*:/s: |
| VB | Verb | \cjRL’mrh |
| yyCLN | Colon | : |
| yyCM | Comma | , |
| yyDASH | hyphen or dash | - |
| yyDOT | Period | . |
| yyELPS | Ellipsis | … |
| yyEXCL | Exclamation mark | ! |
| yyLRB | Left Parenthesis | ( |
| yyQM | Question Mark | ? |
| yyQUOT | Quotation Mark | ” ” |
| yyRRB | Right Parenthesis | ) |
| yySCLN | Semicolon | ; |
| Dependency | Definition | Example | |
|---|---|---|---|
| num | numerical modifier | num (\cjRL’n/syM, \cjRL‘/srwt) | \cjRL‘/srwt ’n/syM mgy‘ym mt’ylnd |
| subj | subject | subj (\cjRLhtbrrh, \cjRLhtwp‘h) | \cjRLhtwp‘h htbrrh ’tmwl |
| ROOT | root | ROOT ( root ,\cjRL.t‘nh) | \cjRLhy’ .t‘nh kK |
| prepmod | prepositional modifier | prepmod (\cjRLm: , \cjRL.sd) | \cjRLm.sd ’.hd |
| pobj | object of a preposition | pobj (\cjRLl: , \cjRLn.sygyM) | \cjRLhy’ tpnh ln.sygym |
| comp | complement | comp (\cjRLbkh, \cjRLk’/sr) | \cjRLhyld bkh k’/sr lq.hw lw ’t h.s‘.sw‘ |
| conj | conjunct | conj (\cjRLw: , \cjRLr‘myM ) | \cjRLr‘myM wbrqyM |
| punct | punctuation | punct (\cjRLn/sm‘h , : ) | !\cjRLttkwpP :\cjRLn/sm‘h qry’h |
| advcl | adverbial clause | advcl (\cjRL’M, \cjRLyw/sg) | \cjRLhM y/sbtw ’M l’ yw/sg hskM |
| advmod | adverbial modifier | advmod (\cjRLytqblw, \cjRLl’ltr) | \cjRLkwlM ytqblw l’ltr |
| obj | object | obj (\cjRLt‘/sh, \cjRLml.hmh) | \cjRLbc.hkmh t‘/sh lk ml.hmh |
| amod | adjectival modifier | amod (\cjRLhby.tw.h,\cjRLhl’wmy) | \cjRLhby.tw.h hl’wmy b/sbyth |
| det | determiner | det (\cjRLhyldyM,\cjRLkl) | \cjRL’ny rw’h ’t kl hyldyM |
| def | definite marker | def (\cjRL’mbwlns,\cjRLh) | \cjRLh’mbwlns ht.hyl lnsw‘ |
| gobj | genitive object | gobj (\cjRLpr/sy,\cjRLm/s.trh) | \cjRLpr/sy m/s.trh y.s’w |
| possmod | possession modifier | possmod (\cjRLw‘dt,\cjRL/sl) | \cjRLw‘dt hkspyM /sl hknst |
| rcmod | relative clause modifier | rcmod (\cjRLhw‘dh,\cjRL/s:) | \cjRLhw‘dh /sdnh bnw/s’ |
| relcomp | relative complement | relcomp (\cjRL/s:,\cjRLdnh) | \cjRLhw‘dh /sdnh bnw/s’ |
| appos | apposition / parenthetical | appos (\cjRLk*”\cjRL.h,\cjRLmpM) | (\cjRLmpM ) \cjRLy’yr .sbN \cjRLk*”\cjRL.h |
| nn | noun modifier | nn (\cjRLsN,\cjRLsymwN) | \cjRLmnzr sN symwN |
| ccomp | complement clause with internal subject | ccomp (\cjRL/s:,\cjRL.hmwd) | \cjRL’mrty lk /s’th .hmwd |
| neg | negative modifier | neg (\cjRLtk‘s,\cjRLl’) | \cjRLhy’ l’ tk‘s |
| pcomp | complement clause of a preposition | pcomp (\cjRLkdy,\cjRLlh/sttP) | \cjRLhw’ .ts kdy lh/sttP bt.hrwt |
| xcomp | complement clause with external subject | xcomp (\cjRLr.sh,\cjRLlh‘lwt) | \cjRLhw’ r.sh lh‘lwt ’t h/skr |
| acc | accusative case | acc (\cjRLly.tpty,\cjRL’t) | \cjRLly.tpty ’t hklb |
| vmod | verb as modifier | vmod (\cjRLsykwy,\cjRLlhtqbl) | \cjRLy/s lw sykwy lhtqbl l’qdmyh |
| gen | genitive case | gen (\cjRLmktbh,\cjRL/sl) | \cjRLmktbh /sl mly pylypsbwrN |
| number | numerical modifier in digits | number(\cjRL.htymwt,84) | \cjRL.htymwt 84 \cjRLhw’ ’sP |
| mwe | multi-word expression | mwe (\cjRLmdy,\cjRL/snh) | \cjRLhw’ .ts mdy /snh |
| goeswith | tokens originally connected with a hyphen | goeswith (\cjRLmwnswn,\cjRLnwwh) | \cjRLmwnswn-\cjRL’n.hnw gryM bnwwh |
| cop | copular element | cop (\cjRLmqwM,\cjRLhy’) | \cjRL’ywbh hy’ mqwM l’ /sgrty |
| cc | introducing conjunction | cc (\cjRL’mr,\cjRLhry) | \cjRLhry hw’ ’mr z’t qwdM |
| npred | noun as predicate | npred (\cjRLhyh,\cjRLqwmwnys.t) | \cjRLhw’ hyh qwmwnys.t |
| parataxis | side-by-side, interjection | parataxis (\cjRL’/sM,\cjRLnwld) | \cjRLhw’ nwld kkh ,\cjRLhw’ l’ ’/sM |
| npadvmod | noun phrase as adverbial modifier | npadvmod (\cjRLyhyh,\cjRLywM) | \cjRLywM ’.hd hw’ yhyh hn/sy’ |
| apred | adjective as predicate | apred (\cjRLhyyty,\cjRLtmyM) | \cjRLmstbr /shyyty tmyM |
| vocative | explicitly addressing a dialogue participant | vocative(\cjRLh‘t,\cjRLrbwty) | \cjRLrbwty ,\cjRLzw h‘t ly/swN |
| aux | auxilary verb or feature-bundle | aux (\cjRL/sqw‘h,\cjRLhyth) | \cjRLklklth hyyth /sqw‘h bmytwn |
| ppred | preposition as predicate | ppred (\cjRLyhyh,\cjRLb*:) | \cjRLm.hr hq.tyP yhyh b‘y.swmw |
| acomp | adjectival complement | acomp (\cjRLnr’h,\cjRLmyw.hd) | \cjRLhw’ nr’h myw.hd |
| qmark | question | qmark (\cjRLykwlyM,\cjRLh’M) | \cjRLh’M ’tM ykwlyM lhm/syk |
| Morphological Analysis Lattice (.ma and .md files) | |||
| Column | Definition | Tag | Comment |
| col 1 | Morpheme Start Index in the Lattice | FROM | |
| col 2 | Morpheme end Index in the Lattice | TO | |
| col 3 | Form of the Morpheme | FORM | |
| col 4 | Lemma of the Morpheme | LEMMA | |
| col 5 | Coarse Part of Speech Tag | CPOSTAG | underscore if unavailable |
| col 6 | Fine Part of Speech Tag | POSTAG | CPOSTAG and POSTAG are identical in YAP |
| col 7 | Morphological Features | FEATS | underscore if unavailable |
| col 8 | Source Token Index | TOKEN | |
| CONLL File format (.conll) | |||
| col 1 | Morpheme Index in the Sentence | ID | |
| col 2 | Form of the Morpheme | FORM | |
| col 3 | Lemma of the Morpheme | LEMMA | underscore if unavailable |
| col 4 | Coarse Part of Speech Tag | CPOSTAG | underscore if unavailable |
| col 5 | Fine Part of Speech Tag | POSTAG | CPOSTAG and POSTAG are identical in YAP |
| col 6 | Morphological Features | FEATS | underscore if unavailable |
| col 7 | Head Index Pointer | HEAD | note that the resulting structure is a tree |
| col 8 | Dependency relation to the HEAD | DEPREL | |
| col 9 | Projective Head | PHEAD | ignore - unused by YAP |
| col 10 | Dependency relation to the PHEAD | PDEPREL | ignore - unused by YAP |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
What’s Wrong with Hebrew NLP?
And How to Make it Right
Reut Tsarfaty Amit Seker Shoval Sadde Stav Klein
Open University of Israel, University Road 1, Ra’anana, Israel
{reutts,shovalsa,amitse,stavkl}@openu.ac.il
Abstract
For languages with simple morphology, such as English, automatic annotation pipelines such as spaCy or Stanford’s CoreNLP successfully serve projects in academia and the industry. For many morphologically-rich languages (MRLs), similar pipelines show sub-optimal performance that limits their applicability for text analysis in research and the industry. The sub-optimal performance is mainly due to errors in early morphological disambiguation decisions, which cannot be recovered later in the pipeline, yielding incoherent annotations on the whole. In this paper we describe the design and use of the Onlp suite, a joint morpho-syntactic parsing framework for processing Modern Hebrew texts. The joint inference over morphology and syntax substantially limits error propagation, and leads to high accuracy. Onlp provides rich and expressive output which already serves diverse academic and commercial needs. Its accompanying online demo further serves educational activities, introducing Hebrew NLP intricacies to researchers and non-researchers alike.
1 Introduction
NLP pipelines for the automatic annotation of unstructured texts are at the core of language technology applications for Data Science, Text Analytic and Artificial Intelligence. For English, annotation pipelines such as spaCy Honnibal and Montani (2017) or Stanford’s CoreNLP Manning et al. (2014) successfully deliver the ability to automatically annotate unstructured texts with their underlying linguistic structures, including: Part-of-Speech (POS) Tags, Morphological Features, Dependency Relations, Named Entities, and so on. These annotations serve research labs, non-profit organizations and commercial endeavors in their quest to make sense of the vast amount of unstructured data available to them.
Universal processing pipelines such as UDPipe Straka et al. (2016) aim to serve a range of other languages, but unfortunately, their performance on many morphologically rich languages (MRLs) Tsarfaty et al. (2010), and in particular Semitic languages, is not on a par with their performance on English. This, in turn, greatly limits their applicability for further research and commercial use. The main reason for this sub-optimal performance on Semitic languages is that the pipeline design inherent in these frameworks is inappropriate for languages that exhibit extreme morphological ambiguity in their input stream. This is because errors made in morphological segmentation and disambiguation early on, jeopardize the system accuracy down the pipeline. For Hebrew, this performance gap has long been a show-stopper for advancing Language Technology and Artificial Intelligence for the Hebrew-speaking community. With this contribution, we aim to remedy this situation.
In this paper we describe the design and use of the Onlp system, a joint morphological-syntactic parsing framework for processing the Semitic language Modren Hebrew (Henceforth, Hebrew). The system is accurate, efficient, and provides rich and expressive output including: Segmentation, POS tags, Lemmas, Features and Labeled Dependencies. The joint training and inference over the different layers substantially limits error propagation, and leads in turn to speed and high accuracy. Among the technical advantages of the Onlp suite are its open license, an easy 3-step installation, and a single package with all elements included — no need to train or maintain individual components separately. The Onlp suite already serves academic and commercial projects in diverse domains. Its accompanying online demo has further proved valuable for educational purposes, exposing CS/NLP and non-CS researchers and engineers to the intricacies of Semitic NLP.
2 The Linguistic Challenge
In morphologically-rich languages (MRLs), each input token may consist of multiple lexical and functional units (henceforth, morphemes), each of which serves a particular role in the overall syntactic or semantic representation. In Hebrew, for example, the token ‘\cjRLwk/smhm‘bdh‘ corresponds to five word tokens in English, each of which carrying its distinct role: ‘\cjRLw‘ (and, CC), ‘\cjRLk/s‘ (when, REL), ‘\cjRLm:‘ (from, IN), ‘\cjRLh‘ (the, DT), ‘\cjRLm‘bdh‘ (lab, NN).111We use the annotation conventions of simaan01 that underlie the Hebrew SPMRL scheme http://www.spmrl.org/spmrl2013-sharedtask.html. This means that in order to process Hebrew texts, one first needs to segment the Hebrew tokens into their constituting morphemes. At the same time, Hebrew raw tokens are highly ambiguous. A token such as: ‘\cjRLhqph‘ may be interpreted as ‘\cjRLhqph‘ (orbit, NN), ‘\cjRLh‘ + ‘\cjRLqph‘ (the+coffee, DT+NN), or ‘\cjRLhqp’+ ‘\cjRL/sl’ + ‘\cjRLhy’‘ (perimeter of her, NN+POSS+PRP), etc. This is further complicated by the lack of diacritics in standardized texts, meaning that most vowels are not present, and that no reading is a-priory more likely than the others, out of context. Only in context the correct interpretation and segmentation become apparent.
These facts create an apparent loop in the design of NLP pipelines for Hebrew: syntactic parsing requires morphological disambiguation – but morphological disambiguation requires syntactic context. This apparent loop has called for the development of joint systems rather than pipelines, for Semitic languages processing Tsarfaty (2006); Green and Manning (2010). This joint hypothesis has proven useful for Hebrew and Arabic phrase-structure parsing Goldberg and Tsarfaty (2008); Green and Manning (2010); Goldberg and Elhadad (2011). The Onlp suite is a dependency-based parsing framework implementing this joint hypothesis, over the entire morpho-syntactic search-space, as depicted in Figure 1 More et al. (2019).
3 The Architectural Design
The core of Onlp is YAP (Yet Another Parser), a morpho-syntactic parser for morphological and syntactic analysis of Hebrew Texts. YAP re-implements and extends the structure-prediction framework of Zhang and Clark (2011). We describe YAP in detail in More and Tsarfaty (2016); More et al. (2019). Here we only provide a bird’s eye view of the architecture.
In YAP we embrace the extreme morphological ambiguity in Hebrew. That is, we do not aim to resolve morphological ambiguity via pre-processing. The input to YAP is the complete Morphological Analysis (MA) of an input sentence , termed here MA. MA is a lattice structure, consisting of all possible morphological analysis possibilities of the input sentence, as seen in the middle of Figure 1. Each arc is a tuple specifying the start-index, end-index, the form of the segment, its part-of-speech, lemma, features, and the index of the raw token the arc has originated from. An arc in the lattice can serve as a node in a syntactic dependency tree. Each contiguous path in the lattice presents one valid morphological segmentation of the sentence, for which a dependency tree can be assigned, as in Figure 1. For each path in the lattice, there is an exponential number of dependency trees that are potentially applicable.
We refer to the task of selecting the most likely lattice-path as Morphological Disambiguation (MD), and to the task of selecting the most likely dependency tree for a given path as Dependency Parsing (DEP). For an input sentence , our goal is to jointly predict a single pair of MD and DEP that are consistent with one another, and form the most-likely analysis of the sentence.
The MD component is the transition-based morphological parser of More and Tsarfaty (2016), which is formally based on the structure-prediction framework of Zhang and Clark (2011). MD accepts a sentence lattice MA(x) as input and delivers a selected sequence of arcs (morphemes) MD(x) as output. The transition-based system for MD selects arcs for MD one at a time. It decodes the lattice using beam-search, and keeps the K-best paths at each step, scored according to morpheme-level and token-level features, weighted via structured-perceptron learning.
The DEP component is a re-implementation of the Zhang and Nivre (2011) dependency parser for English, adapted for Hebrew. We assume an Arc-Eager transition system and beam-search decoding. Feature weights are learned via the structured perceptron. We employ a carefully-designed feature set that reflects linguistic properties of Hebrew such as its rich morphological paradigms, flexible word-order, agreement, etc. This provides SOTA results on Hebrew dependency parsing, albeit in Oracle (i.e., gold morphology) scenario.
Seen that both the MD and DEP realize the same formal framework and inherit from the same computational machinery, we can easily unify them and treat the morpho-synactic task as a single objective. The transition systems are combined and the beam-search decoder interleaves morphological and syntactic decisions.222For a complete formal exposition of the algorithm we refer the reader to More et al. (2019) Now morphological decisions may be affected by syntactic content, and vice versa.
The architecture is depicted in Figure 2. In More et al. (2019) we compared the performance of the joint system to our own pipeline system and to other systems available for Hebrew morphological and syntactic parsing, and showed significant improvements of YAP’s joint model over all competing systems.
4 The Annotation Scheme
We deliver automatic morpho-syntactic annotation of Hebrew texts based on the scheme of the SPMRL Hebrew dependency treebank.333The detailed annotation scheme is provided, with examples, in the supplementary material along with the screencast. The SPMRL Hebrew scheme employs the labels of Sima’an et al. (2001) for morphology and POS tags, and the Unified-SD scheme of Tsarfaty (2013) for the labeled dependencies.444With an eye for future comparability, we further developed a conversion algorithm to convert the the dependency tree from Unified-SD to Universal Dependencies (UD).https://universaldependencies.org/ Specifically, we deliver the following annotation layers:
Morphological Segmentation
The most basic form of analysis of Hebrew texts is the segmentation of raw tokens into multiple meaning-bearing units that we call morphemes. 555In UD they are called words. In Hebrew NLP they are called segments. We use morphemes or segments herein.
Due to orthographic and phonological processes, some morphemes do not appear explicitly in the surface form. Our segmentation recovers all morphemes, both overt and covert.
the token ‘\cjRLbbyt’ (in the house) is segmented as ‘\cjRLb’ + ‘\cjRLh’ + ‘\cjRLbyt’.
Part-of-Speech (POS) Tags
Each morphological segment is assigned a single Part-of-Speech tag category that indicates its syntactic role. The set of tags used by the system is based on the SPMRL scheme which in turn adopts the POS labels from Sima’an et al. (2001) (detailed in our appendix).
Morphological Features
Along with the POS category, we specify for each segment the properties that are signalled by inflectional morphology. The scheme encodes the following properties: Number [S (Singular) / P (Plural) / D (Dual)], **Gender **[F (Female) / M (Male) / F,M (both)], Person [1 / 2 / 3 / A (All)],666A is used in cases where all analyses are valid, such as in Beinoni form - ‘\cjRL’wklt’ (I/you/she eat.singular.feminine) and Tense [Past, Present, Future, Imperative, Infinitive].777Present-tense verbs and participles are tagged ‘Beinoni’.
Lemmas
Each segment is also assigned a lemma, i.e., the cannonical representation of its core (uninflected) meaning.888Note that due to high morphological fusion in Hebrew, simple surface-based stemming will not suffice. For Hebrew nouns and adjectives, the lemma is chosen to be the Masculine-Singular form. For verbs, the lemma is in the Masculine-Singular-3per form in Past tense.
Dependency Tree
The dependency tree is defined over all morphological segments and an artificial root node. It consists of a set of labeled binary relations that indicate the bi-lexical dependencies between segments.
Note that the SPMRL dependency scheme, as opposed to UD, always selects functional heads, rather than lexical heads. The dependency labeling is based on the scheme from Tsarfaty (2013), repeated in the appendix.
Lattices
As explained in section 3 above, a word can be segmented into morphemes in multiple ways, which are constrained by a broad-coverage lexicon. In addition to the parsed output, we makes available for each input sentence its sentence lattice, i.e. the set of all possible segmentations for a given sentence, along with all possible morphosyntactic analyses for each arc.
5 Technical Details and Forms of Use
YAP is implemented in the Go language.999https://golang.org/ It requires 6GB of RAM to run, and employs a simple 3-step installation, given in the supplementray material in the appendix. The input to the system is a tokenized sentence, with tokens appearing one per line, and a line break after every sentence.101010We assume the tokenization convention of MILA Itai and Wintner (2008). The output is a dependency tree (where each node in the tree is a lattice arc) provided in the CoNLL-X format Buchholz and Marsi (2006). YAP is trained on the Hebrew section of the SPMRL shared task. It also makes use of the broad-coverage lexicon of Itai and Wintner (2008) for finding all potential lattice paths. In case of out-of-vocabulary (OOV) items, we employ a simple heuristics where we suggest the 10 most-likely analyses of rare tokens observed during training.
Simple Use Command line
From the command line, one can process one input file at a time, with a single sentence or more. The input file must be formatted with a single token per line, and an empty line denoting the end of every sentence.
Processing a file is done in 2 steps: First, run Morphological Analysis ./yap hebma to generates a sentence lattice containing all possible morphological breakdowns of each token. YAP will save the lattice to the file specified via the -out flag.
Now you can run joint Morphological Disambiguation and Dependency Parsing ./yap joint to jointly predict the best lattice path and corresponding dependency tree. The input to this command is the output file generated in the previous step, and there are 3 output files: one containing word segments, one containing the disambiguated lattice path, and one containing the complete dependency tree in CoNLL-X format.
Advanced Use RESTful API
YAP can run as a RESTful server that accepts parse requests. To do this simply start the server, listening on localhost port 8000. Now you can call the joint endpoint with a json object containing the list of tokens to process in the HTTP data payload. The response is a json object containing the three output levels (MA, MD and Dep). You can use jq and sed (or any other json and line processing tools) to format the (tab separated value) responses and reassemble the output. Check our appendix for an illustration.
Educational Use The Online Demo
In 2018 we decided to create an online demo of the system, for educational purposes: (i) To exposed NLP/AI researchers to NLP capabilities available for Hebrew. (ii) To educate non-CS scientists and engineers who work with Hebrew data (e.g., digital humanities) on text annotations that can potentially be useful for their applications. (iii) To launch outreach activities where we teach what is NLP to the local community (e.g., school kids).111111E.g., https://www.youtube.com/watch?v=TFwQeoKpznA&feature=youtu.be
To use the demo, simply go to onlp.openu.ac.il and type Hebrew sentence in the textbox. The demo is built with Django and Bootstrap web frameworks. It sends the user’s Hebrew text input to the Onlp server, which returns a CoNLL-X formatted parse along with the complete sentence lattice. Pre-processing includes pre-morphological tokenization of the input, where punctuation is being separated from the tokens. Double quotation marks are being separated from the word unless they appear before the last character of the word, to avoid over-segmentation of acronyms.121212Acronyms in Hebrew are written with a quotation mark before the last letter, e.g. ‘\cjRLb”\cjRL’rh’ (USA) . The tokenized sequence is then passed to the Onlp server. The CoNLL-X output is then processed into the following layers: the FORM column is concatenated and presented as ”Segmented Text”, and the POS, LEMMA, FEATS and DEPS are presented in separate accordion tabs.
Furthermore, the demo presents the sentence lattice which is the input to the joint parser. This is useful for debugging, and for analyzing lexical-coverage in out-of-domain scenarios.
Expert Use Out of Domain Scenarios
A bottleneck for the system in out-of-domain parsing scenarios is the coverage of the lexicon. We rely on a general-purpose lexicon containing over 500K entries. OOV words are treated via heuristics we designed, which are suitable for the general case only. However, identifying accurately vocabulary items may be critical when applying the parser to new domains with domain-specific information (medical, financial, political, etc.). Fortunately, we can extend the system with a domain-specific lexicon, thus extending the MA coverage. Due to joint inference, the availability of a better suited lexical analysis triggers better lexico-syntactic decisions on the whole.131313We discuss how exactly this is executed in the appendix.
6 Related and Future Work
Hebrew NLP in general and Hebrew parsing in particular are known to be challenging, due to interesting linguistic properties, the scarcity of annotated data, and the small research community around. So, Hebrew has been seriously under-studied in NLP. During the early 2000, the MILA knowledge center was established, where the two of the main Hebrew resources for NLP were developed: the Hebrew treebank Sima’an et al. (2001) and the Hebrew Lexicon Itai and Wintner (2008).
Morphological Taggers for Hebrew using local linear-context have been trained on these data and were made available for free use Adler and Elhadad (2006); Bar-haim et al. (2008). However, their performance was not on a par with parallel tools for English and thus insufficient for commercial use. Hebrew dependency parsing was initially provided by Goldberg and Elhadad (2009), but the parser provides unlabeled dependency, and the pipeline relied on Adler’s morphological tagger. This left the automatic dependency trees inaccurate and unsatisfying. Joint morpho-syntactic models for constituency-based parsing models Tsarfaty (2010) showed good performance on benchmark data, but their code was never released for open use.
With the development of the UD treebanks collection, general frameworks such as UDPipe Straka et al. (2016) and CoreNLP Manning et al. (2014) have been trained on the Hebrew UD treebank, and made the model available. However, these models provide performance that is still far from satisfactory, As we also demonstrate in our screen-cast,141414https://www.youtube.com/watch?v=H6pvh1x20FQ these systems make very basic mistakes, even with the simplest sentence. We conjecture that this is due to their inherent pipeline assumption: initial layers of processing present many mistakes. due to the extreme morphological ambiguity, and later layers cannot recover. Notably, also neural network models utilizing word embeddings, (e.g., UDPipe) still lag behind.
Table 1 shows the task-coverage of existing tools and toolkits for NLP in Hebrew, academic as well as private initiatives (NITE,Hebrew-NLP). The task-coverage of the Onlp suite we present is on a par with international standards (UDPipe, CoreNLP), and its level of performance was shown to exceed all existing models More et al. (2019). We are currently working towards Named-Entity Recognition as well as Open Information Extraction, to be added to Onlp in the near future.
7 Conclusion
This paper presents Onlp, a complete language-processing framework for automatic annotation of Modern Hebrew Texts. The framework covers morphological segmentation, POS tags, lemmas and features, and dependency parsing, predicted jointly. The system is easy to install and to use, and we support multiple forms of usage fitting user-personas with different needs. We hope the availability of an open-source, accurate, and easy-to-use system for NLP in Hebrew will benefit the local NLP open-source community and greatly advance Hebrew language technology research and development, in academia and in the industry.
Acknowledgements
We thank the NLPH community, in particular Shay Palachi, Amit Shkolnick and Yuval Feinstein, for much discussion and insightful comments. We further thank the Avi Bivas (Innovation Authority) and Milo Avisar for promoting NLP initiatives in Israel. This research is supported by an ISF grant (1739/26) and an ERC Starting grant (677352), for which we are grateful.
Supplementary Material For EMNLP Demo Paper
These supplementary materials document the absolute essentials for starting to use the system: installation, annotation scheme documentation, forms of use, and enhancements for out-of-domains scenarios.
Appendix A Resources
-
- YAP Github:
https://github.com/OnlpLab/yap
- YAP Demo - Website:
- YAP Demo - Screencast: (Youtube)
https://www.youtube.com/watch?v=H6pvh1x20FQ
- YAP Python-Wrapper:
https://github.com/amit-shkolnik/YAP-Wrapper
- SPMRL-to-UD Conversion:
https://github.com/OnlpLab/Hebrew_UD
- ONLP Lab Website:
Appendix B Screen-Cast
Check out our screen-cast online demo at: https://www.youtube.com/watch?v=H6pvh1x20FQ
Appendix C Morphological Ambiguity: Lattices
Table 2 shows a sentence lattice capturing the high ambiguity of Hebrew morphological analysis. For a simple 3-tokens input sentence, 22 possible arcs present valid analyses of the various tokens. A single consecutive path through the lattice needs to be selected, for the sentence to be further processed by syntactic parsers or downstream applications.
Appendix D Annotation Layers
The annotation scheme provided by Onlp corresponds to the Hebrew section of the SPMRL shared task. 2013-2014151515http://www.spmrl.org/spmrl2013-sharedtask.html The Part-of-Speech Tags we employ are provided, along with illustrative examples, in Table 3. The Dependency labels are defined and illustrated in Table 5.
Appendix E The Online Demo
In Figure 3 we present a screen capture of the Morphological Segmentation, POS tags and Dependency Relations for two raw input sentences:
- •
‘\cjRLhbn /skb b.sl‘
’the-boy was-lying in-the-shade’
- •
‘\cjRLhbn /snm b.sl‘
’the-boy that-was-napping in-the-shade’
As executed on our demo page. Note that the two raw sentences have very similar form (in fact, they only differ in two characters). But they end up forming very different syntactic structures, which the Onlp system annotates correctly.
Appendix F Forms of Use
Figures 4–6 present the usage patterns with the YAP parser, the core algorithm of the framework. In Figure 4 we present the 3-step installation, in Figure 5 we show a simple command-line use, and in Figure 6 we show how to use YAP as a service. As noted before, The input file must be formatted with a single token per line and an empty line denoting end of sentence.161616Crucially, the last line in the file must be empty to denote the end of the last sentence.171717A note for Windows users: YAP doesn’t handle Windows style text files that have BOM marks and CRLF newlines. So if you’re running on Windows and YAP doesn’t work, make sure you don’t have CRLF line endings and no BOM marks. YAP has been written in Go in order to enable multi-threading. This means that it can be called from multiple threads in parallel. As of June 2019 there is also a python wrapper, created by members of the Israeli open-source community.181818The Credit goes to Amit Shkolnik of the 4girls initiative. Further details can be found here: https://github.com/amit-shkolnik/YAP-Wrapper
Appendix G Out-of-Domain Scenarios
When observing errors in a new domain, one first thing we have to check is whether or not these are due to lexical gaps. I.e., whether they stem from lack of coverage of the lexicon. The availability of the sentence lattice output is of great value in this respect. By reviewing the lattice, it is possible to see whether the lexicon contains the correct morphological analysis for the input token at all. If the correct analysis is not in the lattice, it is easy to add the missing analyses by editing the lexicon.191919The lexicon file located at data/bgulex/bgulex.utf8.hr
Each line in the lexicon file contains a token followed by a list of one or more possible morphological analyses of that token. An analysis is a tuple made of 3 parts prefix:host:suffix followed by the host lemma. Each tuple member contains the part-of-speech tag and morphological features for any of these elements. prefix and suffix can possibly be empty. E.g.
\cjRL*”bd :VB-MF-S-1-FUTURE-NIFAL: \cjRLn’bd :VB-MF-S-1-FUTURE-PIEL: \cjRL’ybd*
An example use case could arise when processing medical domain texts related to cancer in which the word ‘\cjRLlymph‘ (lymph) appears in the text but is missing from the lexicon. In this case, the parser errs in identifying the first ‘\cjRLl’ as the preposition ”to”, followed by a proper noun.
To remedy this, we can update the lexicon by adding the following line:
\cjRL*lymph :NN-F-S: \cjRLlymph
*This means that the token \cjRLlymph is a common noun with feminine gender and singular number, followed by the lemma, and that it is unambigous (i.e., only one analysis is available). Note that after updating the lexicon you need to restart YAP (if running as a restful server) for the lexical changes to apply.
Now that \cjRLlymph is no longer an OOV, sentences containing this token will be given a more accurate lattice and as a result will be analyzed with a global syntactic structure that accords with the correct analysis. We suggested these lexicon edits for our users working in specific domains in the industry (medical, social, political), and they attested to significant improvements when running on particular domains.202020Yuval Feinstain, NLP Consultant, p.c.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Adler and Elhadad (2006) Meni Adler and Michael Elhadad. 2006. An unsupervised morpheme-based hmm for Hebrew morphological disambiguation. In ACL . The Association for Computer Linguistics.
- 2Bar-haim et al. (2008) Roy Bar-haim, Khalil Sima’an, and Yoad Winter. 2008. Part-of-speech tagging of Modern Hebrew text. Natural Language Engineering , 14(2):223–251.
- 3Buchholz and Marsi (2006) Sabine Buchholz and Erwin Marsi. 2006. Conll-x shared task on multilingual dependency parsing. In Proceedings of Co NLL , pages 149–164.
- 4Goldberg and Elhadad (2009) Yoav Goldberg and Michael Elhadad. 2009. Hebrew dependency parsing: Initial results . In Proceedings of the 11th International Conference on Parsing Technologies , IWPT ’09, pages 129–133.
- 5Goldberg and Elhadad (2011) Yoav Goldberg and Michael Elhadad. 2011. Joint Hebrew segmentation and parsing using a PCFGLA lattice parser. In Proceedings of ACL .
- 6Goldberg and Tsarfaty (2008) Yoav Goldberg and Reut Tsarfaty. 2008. A single framework for joint morphological segmentation and syntactic parsing. In Proceedings of ACL .
- 7Green and Manning (2010) Spence Green and Christopher D. Manning. 2010. Better Arabic parsing: Baselines, evaluations, and analysis. In Proceedings of COLING .
- 8Honnibal and Montani (2017) Matthew Honnibal and Ines Montani. 2017. spa Cy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear.
