What's Wrong with Hebrew NLP? And How to Make it Right

Reut Tsarfaty; Amit Seker; Shoval Sadde; Stav Klein

arXiv:1908.05453·cs.CL·August 16, 2019

What's Wrong with Hebrew NLP? And How to Make it Right

Reut Tsarfaty, Amit Seker, Shoval Sadde, Stav Klein

PDF

TL;DR

This paper introduces Onlp, a joint morpho-syntactic parser for Modern Hebrew that improves accuracy by reducing error propagation, addressing challenges faced by NLP tools in morphologically-rich languages.

Contribution

The paper presents a novel joint inference framework for Hebrew NLP that enhances accuracy and provides rich output, filling a gap in tools for morphologically-rich languages.

Findings

01

Onlp achieves high accuracy in Hebrew morphological and syntactic parsing.

02

Joint inference reduces error propagation compared to pipeline approaches.

03

The tool supports diverse academic and commercial applications.

Abstract

For languages with simple morphology, such as English, automatic annotation pipelines such as spaCy or Stanford's CoreNLP successfully serve projects in academia and the industry. For many morphologically-rich languages (MRLs), similar pipelines show sub-optimal performance that limits their applicability for text analysis in research and the industry.The sub-optimal performance is mainly due to errors in early morphological disambiguation decisions, which cannot be recovered later in the pipeline, yielding incoherent annotations on the whole. In this paper we describe the design and use of the Onlp suite, a joint morpho-syntactic parsing framework for processing Modern Hebrew texts. The joint inference over morphology and syntax substantially limits error propagation, and leads to high accuracy. Onlp provides rich and expressive output which already serves diverse academic and…

Tables5

Table 1. Table 1: Existing Coverage for Hebrew NLP Tasks

	Tok	MA	MD	POS	Lem	Feats	Deps	Joint
Tasks
MILA	$✓$	$✓$
NITE	$✓$	$✓$	$✓$
Hebrew-NLP		$✓$
Adler			$✓$	✓		✓
Goldberg							✓
Pipelines
UDPipe	$✓$	$✓$	$✓$	$✓$	$✓$	$✓$	$✓$
CoreNLP	$✓$	$✓$	$✓$	$✓$	$✓$	$✓$	$✓$
ONLP		$✓$	$✓$	$✓$	$✓$	$✓$	$✓$	$✓$

Table 2. Table 2: The Lattice representation for ‘ \cjRL hbn /snm b.sl‘ ‘The boy who slept in the shade’. Col 1-2: From/To - the start and end nodes of the morpheme. The numbers are with respect to the maximal length route. Col 3: Form - the surface form of the morphological segment. Col 4-5-6: Form/Lemma/Part of Speech - the same segment may belong to different entries in the lexicon. Each entry is given in a separate row, where the differences between the different meanings are surfaced in one (or more) of the Form/Lemma/Part of Speech columns. Col 7: Token Number - represents the index of the raw (space-delimited) token in the input before segmentation.

From	To	Form	Lemma	Part of Speech	Features	Token Number
0	1	\cjRLh	\cjRLh	DEF	_	1
0	3	\cjRLh	\cjRLh	REL	_	1
0	5	\cjRLhbn	\cjRLhbyn	VB	gen=M,num=S,per=2,tense=IMPERATIVE	1
1	2	\cjRLb	\cjRLb	IN	_	1
1	5	\cjRLbn	\cjRLbn	NNP	gen=M,num=S	1
1	5	\cjRLbn	\cjRLbn	NNT	gen=M,num=S	1
1	5	\cjRLbn	\cjRLbn	NN	gen=M,num=S	1
2	5	\cjRLhn	\cjRLhn	S_PRN	gen=F,num=P,per=3	1
3	4	\cjRLb	\cjRLb	IN	_	1
3	5	\cjRLbn	\cjRLbn	NNP	gen=M,num=S	1
3	5	\cjRLbn	\cjRLbn	NNT	gen=M,num=S	1
3	5	\cjRLbn	\cjRLbn	NN	gen=M,num=S	1
4	5	\cjRLhn	\cjRLhn	S_PRN	gen=F,num=P,per=3	1
5	6	\cjRL/s	\cjRL/s	REL	_	2
5	7	\cjRL/snm	\cjRL/sn	NN	gen=F,num=S,suf_gen=M,suf_num=P,suf_per=3	2
6	7	\cjRLnm	\cjRLnm	VB	gen=M,num=S,per=A,tense=BEINONI	2
6	7	\cjRLnm	\cjRLnm	BNT	gen=M,num=S,per=A	2
6	7	\cjRLnm	\cjRLnm	BN	gen=M,num=S,per=A	2
6	7	\cjRLnm	\cjRLnm	VB	gen=M,num=S,per=3,tense=PAST	2
7	8	\cjRLb	\cjRLb	PREPOSITION	_	3
7	10	\cjRLb.sl	\cjRLb.sl	NN	gen=M,num=S	3
7	10	\cjRLb.sl	\cjRLb.sl	NNT	gen=M,num=S	3
8	9	\cjRLh	\cjRLh	DEF	_	3
8	10	\cjRL.sl	\cjRL.sl	NN	gen=M,num=S	3
8	10	\cjRL.sl	\cjRL.sl	NNT	gen=M,num=S	3
9	10	\cjRL.sl	\cjRL.sl	NNT	gen=M,num=S	3
9	10	\cjRL.sl	\cjRL.sl	NN	gen=M,num=S	3

Table 3. Table 3: The Part-of-Speech Tags Provided by Onlp

POS	Definition	Example
ADVERB	The word \cjRLk*: before numerals	\cjRLk*:mylywn
AT	The accusative marker \cjRL’t which is a seperate word in Hebrew	\cjRL’t hklb
BN	Participle (Beinoni)	\cjRLmgy‘ym
BNT	Participle in construct state form	\cjRLmqymy h‘ytwn
CC	Conjunction	\cjRL’l’
REL	Relative clause marker	\cjRL/s:
CD	Numeral	\cjRLm’wt
CDT	Numeral in construct state	\cjRL’lpy
CONJ	Coordinating conjunction \cjRLw	\cjRLw:
COP	Copula	\cjRLhyh
DEF	A special tag assigned to the definite marker \cjRLh which appears with nouns, adjectives and numerals	\cjRLh
DTT	Determiner	\cjRLkl
DUMMY_AT	Accusative marker \cjRL’t when used with a pronominal suffix	\cjRL’wtw
EX	The existential markers \cjRLy/s or \cjRL’yn	\cjRLy/s
IN	Preposition	\cjRL‘d
INTJ	Interjection	\cjRLn’
JJ	Adjective	\cjRLzrym
JJT	Adjective in construct state	\cjRLypy np/s
MD	Modal predicates	\cjRL.sryK
NN	Noun	\cjRL.hbr
NN_S_PP	Noun with a pronominal suffix	\cjRLpw‘lyhM
NNP	Proper Noun	\cjRLn‘my
NNT	Construct state noun	\cjRLh‘sqt
P	Prefix written as a separate word	\cjRLblty
POS	Possessive preposition \cjRL/sl	\cjRL/sl
PREPOSITION	Inseperable preposition	\cjRLb*:
PRP	Personal Pronoun	\cjRLhy’
S_PRP	Reflexive pronoun	\cjRL‘.smy
QW	Question word	\cjRLky.sd
S_PRN	Personal pronoun attached to a preposition as a pronominal suffix	\cjRL’wtnw
TEMP	Subordinating conjunction introducing time clauses	\cjRLk*:/s:
VB	Verb	\cjRL’mrh
yyCLN	Colon	:
yyCM	Comma	,
yyDASH	hyphen or dash	-
yyDOT	Period	.
yyELPS	Ellipsis	…
yyEXCL	Exclamation mark	!
yyLRB	Left Parenthesis	(
yyQM	Question Mark	?
yyQUOT	Quotation Mark	” ”
yyRRB	Right Parenthesis	)
yySCLN	Semicolon	;

Table 4. Table 4: The Dependency Labels Provided by Onlp

Dependency	Definition	Example
num	numerical modifier	num (\cjRL’n/syM, \cjRL‘/srwt)	\cjRL‘/srwt ’n/syM mgy‘ym mt’ylnd
subj	subject	subj (\cjRLhtbrrh, \cjRLhtwp‘h)	\cjRLhtwp‘h htbrrh ’tmwl
ROOT	root	ROOT ( root ,\cjRL.t‘nh)	\cjRLhy’ .t‘nh kK
prepmod	prepositional modifier	prepmod (\cjRLm: , \cjRL.sd)	\cjRLm.sd ’.hd
pobj	object of a preposition	pobj (\cjRLl: , \cjRLn.sygyM)	\cjRLhy’ tpnh ln.sygym
comp	complement	comp (\cjRLbkh, \cjRLk’/sr)	\cjRLhyld bkh k’/sr lq.hw lw ’t h.s‘.sw‘
conj	conjunct	conj (\cjRLw: , \cjRLr‘myM )	\cjRLr‘myM wbrqyM
punct	punctuation	punct (\cjRLn/sm‘h , : )	!\cjRLttkwpP :\cjRLn/sm‘h qry’h
advcl	adverbial clause	advcl (\cjRL’M, \cjRLyw/sg)	\cjRLhM y/sbtw ’M l’ yw/sg hskM
advmod	adverbial modifier	advmod (\cjRLytqblw, \cjRLl’ltr)	\cjRLkwlM ytqblw l’ltr
obj	object	obj (\cjRLt‘/sh, \cjRLml.hmh)	\cjRLbc.hkmh t‘/sh lk ml.hmh
amod	adjectival modifier	amod (\cjRLhby.tw.h,\cjRLhl’wmy)	\cjRLhby.tw.h hl’wmy b/sbyth
det	determiner	det (\cjRLhyldyM,\cjRLkl)	\cjRL’ny rw’h ’t kl hyldyM
def	definite marker	def (\cjRL’mbwlns,\cjRLh)	\cjRLh’mbwlns ht.hyl lnsw‘
gobj	genitive object	gobj (\cjRLpr/sy,\cjRLm/s.trh)	\cjRLpr/sy m/s.trh y.s’w
possmod	possession modifier	possmod (\cjRLw‘dt,\cjRL/sl)	\cjRLw‘dt hkspyM /sl hknst
rcmod	relative clause modifier	rcmod (\cjRLhw‘dh,\cjRL/s:)	\cjRLhw‘dh /sdnh bnw/s’
relcomp	relative complement	relcomp (\cjRL/s:,\cjRLdnh)	\cjRLhw‘dh /sdnh bnw/s’
appos	apposition / parenthetical	appos (\cjRLk*”\cjRL.h,\cjRLmpM)	(\cjRLmpM ) \cjRLy’yr .sbN \cjRLk*”\cjRL.h
nn	noun modifier	nn (\cjRLsN,\cjRLsymwN)	\cjRLmnzr sN symwN
ccomp	complement clause with internal subject	ccomp (\cjRL/s:,\cjRL.hmwd)	\cjRL’mrty lk /s’th .hmwd
neg	negative modifier	neg (\cjRLtk‘s,\cjRLl’)	\cjRLhy’ l’ tk‘s
pcomp	complement clause of a preposition	pcomp (\cjRLkdy,\cjRLlh/sttP)	\cjRLhw’ .ts kdy lh/sttP bt.hrwt
xcomp	complement clause with external subject	xcomp (\cjRLr.sh,\cjRLlh‘lwt)	\cjRLhw’ r.sh lh‘lwt ’t h/skr
acc	accusative case	acc (\cjRLly.tpty,\cjRL’t)	\cjRLly.tpty ’t hklb
vmod	verb as modifier	vmod (\cjRLsykwy,\cjRLlhtqbl)	\cjRLy/s lw sykwy lhtqbl l’qdmyh
gen	genitive case	gen (\cjRLmktbh,\cjRL/sl)	\cjRLmktbh /sl mly pylypsbwrN
number	numerical modifier in digits	number(\cjRL.htymwt,84)	\cjRL.htymwt 84 \cjRLhw’ ’sP
mwe	multi-word expression	mwe (\cjRLmdy,\cjRL/snh)	\cjRLhw’ .ts mdy /snh
goeswith	tokens originally connected with a hyphen	goeswith (\cjRLmwnswn,\cjRLnwwh)	\cjRLmwnswn-\cjRL’n.hnw gryM bnwwh
cop	copular element	cop (\cjRLmqwM,\cjRLhy’)	\cjRL’ywbh hy’ mqwM l’ /sgrty
cc	introducing conjunction	cc (\cjRL’mr,\cjRLhry)	\cjRLhry hw’ ’mr z’t qwdM
npred	noun as predicate	npred (\cjRLhyh,\cjRLqwmwnys.t)	\cjRLhw’ hyh qwmwnys.t
parataxis	side-by-side, interjection	parataxis (\cjRL’/sM,\cjRLnwld)	\cjRLhw’ nwld kkh ,\cjRLhw’ l’ ’/sM
npadvmod	noun phrase as adverbial modifier	npadvmod (\cjRLyhyh,\cjRLywM)	\cjRLywM ’.hd hw’ yhyh hn/sy’
apred	adjective as predicate	apred (\cjRLhyyty,\cjRLtmyM)	\cjRLmstbr /shyyty tmyM
vocative	explicitly addressing a dialogue participant	vocative(\cjRLh‘t,\cjRLrbwty)	\cjRLrbwty ,\cjRLzw h‘t ly/swN
aux	auxilary verb or feature-bundle	aux (\cjRL/sqw‘h,\cjRLhyth)	\cjRLklklth hyyth /sqw‘h bmytwn
ppred	preposition as predicate	ppred (\cjRLyhyh,\cjRLb*:)	\cjRLm.hr hq.tyP yhyh b‘y.swmw
acomp	adjectival complement	acomp (\cjRLnr’h,\cjRLmyw.hd)	\cjRLhw’ nr’h myw.hd
qmark	question	qmark (\cjRLykwlyM,\cjRLh’M)	\cjRLh’M ’tM ykwlyM lhm/syk

Table 5. Table 5: Columns Definitions in .ma, .md and .conll files

	Morphological Analysis Lattice (.ma and .md files)
Column	Definition	Tag	Comment
col 1	Morpheme Start Index in the Lattice	FROM
col 2	Morpheme end Index in the Lattice	TO
col 3	Form of the Morpheme	FORM
col 4	Lemma of the Morpheme	LEMMA
col 5	Coarse Part of Speech Tag	CPOSTAG	underscore if unavailable
col 6	Fine Part of Speech Tag	POSTAG	CPOSTAG and POSTAG are identical in YAP
col 7	Morphological Features	FEATS	underscore if unavailable
col 8	Source Token Index	TOKEN
	CONLL File format (.conll)
col 1	Morpheme Index in the Sentence	ID
col 2	Form of the Morpheme	FORM
col 3	Lemma of the Morpheme	LEMMA	underscore if unavailable
col 4	Coarse Part of Speech Tag	CPOSTAG	underscore if unavailable
col 5	Fine Part of Speech Tag	POSTAG	CPOSTAG and POSTAG are identical in YAP
col 6	Morphological Features	FEATS	underscore if unavailable
col 7	Head Index Pointer	HEAD	note that the resulting structure is a tree
col 8	Dependency relation to the HEAD	DEPREL
col 9	Projective Head	PHEAD	ignore - unused by YAP
col 10	Dependency relation to the PHEAD	PDEPREL	ignore - unused by YAP

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

What’s Wrong with Hebrew NLP?

And How to Make it Right

Reut Tsarfaty Amit Seker Shoval Sadde Stav Klein

Open University of Israel, University Road 1, Ra’anana, Israel

{reutts,shovalsa,amitse,stavkl}@openu.ac.il

Abstract

For languages with simple morphology, such as English, automatic annotation pipelines such as spaCy or Stanford’s CoreNLP successfully serve projects in academia and the industry. For many morphologically-rich languages (MRLs), similar pipelines show sub-optimal performance that limits their applicability for text analysis in research and the industry. The sub-optimal performance is mainly due to errors in early morphological disambiguation decisions, which cannot be recovered later in the pipeline, yielding incoherent annotations on the whole. In this paper we describe the design and use of the Onlp suite, a joint morpho-syntactic parsing framework for processing Modern Hebrew texts. The joint inference over morphology and syntax substantially limits error propagation, and leads to high accuracy. Onlp provides rich and expressive output which already serves diverse academic and commercial needs. Its accompanying online demo further serves educational activities, introducing Hebrew NLP intricacies to researchers and non-researchers alike.

1 Introduction

NLP pipelines for the automatic annotation of unstructured texts are at the core of language technology applications for Data Science, Text Analytic and Artificial Intelligence. For English, annotation pipelines such as spaCy Honnibal and Montani (2017) or Stanford’s CoreNLP Manning et al. (2014) successfully deliver the ability to automatically annotate unstructured texts with their underlying linguistic structures, including: Part-of-Speech (POS) Tags, Morphological Features, Dependency Relations, Named Entities, and so on. These annotations serve research labs, non-profit organizations and commercial endeavors in their quest to make sense of the vast amount of unstructured data available to them.

Universal processing pipelines such as UDPipe Straka et al. (2016) aim to serve a range of other languages, but unfortunately, their performance on many morphologically rich languages (MRLs) Tsarfaty et al. (2010), and in particular Semitic languages, is not on a par with their performance on English. This, in turn, greatly limits their applicability for further research and commercial use. The main reason for this sub-optimal performance on Semitic languages is that the pipeline design inherent in these frameworks is inappropriate for languages that exhibit extreme morphological ambiguity in their input stream. This is because errors made in morphological segmentation and disambiguation early on, jeopardize the system accuracy down the pipeline. For Hebrew, this performance gap has long been a show-stopper for advancing Language Technology and Artificial Intelligence for the Hebrew-speaking community. With this contribution, we aim to remedy this situation.

In this paper we describe the design and use of the Onlp system, a joint morphological-syntactic parsing framework for processing the Semitic language Modren Hebrew (Henceforth, Hebrew). The system is accurate, efficient, and provides rich and expressive output including: Segmentation, POS tags, Lemmas, Features and Labeled Dependencies. The joint training and inference over the different layers substantially limits error propagation, and leads in turn to speed and high accuracy. Among the technical advantages of the Onlp suite are its open license, an easy 3-step installation, and a single package with all elements included — no need to train or maintain individual components separately. The Onlp suite already serves academic and commercial projects in diverse domains. Its accompanying online demo has further proved valuable for educational purposes, exposing CS/NLP and non-CS researchers and engineers to the intricacies of Semitic NLP.

2 The Linguistic Challenge

In morphologically-rich languages (MRLs), each input token may consist of multiple lexical and functional units (henceforth, morphemes), each of which serves a particular role in the overall syntactic or semantic representation. In Hebrew, for example, the token ‘\cjRLwk/smhm‘bdh‘ corresponds to five word tokens in English, each of which carrying its distinct role: ‘\cjRLw‘ (and, CC), ‘\cjRLk/s‘ (when, REL), ‘\cjRLm:‘ (from, IN), ‘\cjRLh‘ (the, DT), ‘\cjRLm‘bdh‘ (lab, NN).111We use the annotation conventions of simaan01 that underlie the Hebrew SPMRL scheme http://www.spmrl.org/spmrl2013-sharedtask.html. This means that in order to process Hebrew texts, one first needs to segment the Hebrew tokens into their constituting morphemes. At the same time, Hebrew raw tokens are highly ambiguous. A token such as: ‘\cjRLhqph‘ may be interpreted as ‘\cjRLhqph‘ (orbit, NN), ‘\cjRLh‘ + ‘\cjRLqph‘ (the+coffee, DT+NN), or ‘\cjRLhqp’+ ‘\cjRL/sl’ + ‘\cjRLhy’‘ (perimeter of her, NN+POSS+PRP), etc. This is further complicated by the lack of diacritics in standardized texts, meaning that most vowels are not present, and that no reading is a-priory more likely than the others, out of context. Only in context the correct interpretation and segmentation become apparent.

These facts create an apparent loop in the design of NLP pipelines for Hebrew: syntactic parsing requires morphological disambiguation – but morphological disambiguation requires syntactic context. This apparent loop has called for the development of joint systems rather than pipelines, for Semitic languages processing Tsarfaty (2006); Green and Manning (2010). This joint hypothesis has proven useful for Hebrew and Arabic phrase-structure parsing Goldberg and Tsarfaty (2008); Green and Manning (2010); Goldberg and Elhadad (2011). The Onlp suite is a dependency-based parsing framework implementing this joint hypothesis, over the entire morpho-syntactic search-space, as depicted in Figure 1 More et al. (2019).

3 The Architectural Design

The core of Onlp is YAP (Yet Another Parser), a morpho-syntactic parser for morphological and syntactic analysis of Hebrew Texts. YAP re-implements and extends the structure-prediction framework of Zhang and Clark (2011). We describe YAP in detail in More and Tsarfaty (2016); More et al. (2019). Here we only provide a bird’s eye view of the architecture.

In YAP we embrace the extreme morphological ambiguity in Hebrew. That is, we do not aim to resolve morphological ambiguity via pre-processing. The input to YAP is the complete Morphological Analysis (MA) of an input sentence $x$ , termed here MA $(x)$ . MA $(x)$ is a lattice structure, consisting of all possible morphological analysis possibilities of the input sentence, as seen in the middle of Figure 1. Each arc is a tuple specifying the start-index, end-index, the form of the segment, its part-of-speech, lemma, features, and the index of the raw token the arc has originated from. An arc in the lattice can serve as a node in a syntactic dependency tree. Each contiguous path in the lattice presents one valid morphological segmentation of the sentence, for which a dependency tree can be assigned, as in Figure 1. For each path in the lattice, there is an exponential number of dependency trees that are potentially applicable.

We refer to the task of selecting the most likely lattice-path as Morphological Disambiguation (MD), and to the task of selecting the most likely dependency tree for a given path as Dependency Parsing (DEP). For an input sentence $x$ , our goal is to jointly predict a single pair of MD $(x)$ and DEP $(x)$ that are consistent with one another, and form the most-likely analysis of the sentence.

The MD component is the transition-based morphological parser of More and Tsarfaty (2016), which is formally based on the structure-prediction framework of Zhang and Clark (2011). MD accepts a sentence lattice MA(x) as input and delivers a selected sequence of arcs (morphemes) MD(x) as output. The transition-based system for MD selects arcs for MD one at a time. It decodes the lattice using beam-search, and keeps the K-best paths at each step, scored according to morpheme-level and token-level features, weighted via structured-perceptron learning.

The DEP component is a re-implementation of the Zhang and Nivre (2011) dependency parser for English, adapted for Hebrew. We assume an Arc-Eager transition system and beam-search decoding. Feature weights are learned via the structured perceptron. We employ a carefully-designed feature set that reflects linguistic properties of Hebrew such as its rich morphological paradigms, flexible word-order, agreement, etc. This provides SOTA results on Hebrew dependency parsing, albeit in Oracle (i.e., gold morphology) scenario.

Seen that both the MD and DEP realize the same formal framework and inherit from the same computational machinery, we can easily unify them and treat the morpho-synactic task as a single objective. The transition systems are combined and the beam-search decoder interleaves morphological and syntactic decisions.222For a complete formal exposition of the algorithm we refer the reader to More et al. (2019) Now morphological decisions may be affected by syntactic content, and vice versa.

The architecture is depicted in Figure 2. In More et al. (2019) we compared the performance of the joint system to our own pipeline system and to other systems available for Hebrew morphological and syntactic parsing, and showed significant improvements of YAP’s joint model over all competing systems.

4 The Annotation Scheme

We deliver automatic morpho-syntactic annotation of Hebrew texts based on the scheme of the SPMRL Hebrew dependency treebank.333The detailed annotation scheme is provided, with examples, in the supplementary material along with the screencast. The SPMRL Hebrew scheme employs the labels of Sima’an et al. (2001) for morphology and POS tags, and the Unified-SD scheme of Tsarfaty (2013) for the labeled dependencies.444With an eye for future comparability, we further developed a conversion algorithm to convert the the dependency tree from Unified-SD to Universal Dependencies (UD).https://universaldependencies.org/ Specifically, we deliver the following annotation layers:

Morphological Segmentation

The most basic form of analysis of Hebrew texts is the segmentation of raw tokens into multiple meaning-bearing units that we call morphemes. 555In UD they are called words. In Hebrew NLP they are called segments. We use morphemes or segments herein.

Due to orthographic and phonological processes, some morphemes do not appear explicitly in the surface form. Our segmentation recovers all morphemes, both overt and covert.

the token ‘\cjRLbbyt’ (in the house) is segmented as ‘\cjRLb’ + ‘\cjRLh’ + ‘\cjRLbyt’.

Part-of-Speech (POS) Tags

Each morphological segment is assigned a single Part-of-Speech tag category that indicates its syntactic role. The set of tags used by the system is based on the SPMRL scheme which in turn adopts the POS labels from Sima’an et al. (2001) (detailed in our appendix).

Morphological Features

Along with the POS category, we specify for each segment the properties that are signalled by inflectional morphology. The scheme encodes the following properties: Number [S (Singular) / P (Plural) / D (Dual)], **Gender **[F (Female) / M (Male) / F,M (both)], Person [1 / 2 / 3 / A (All)],666A is used in cases where all analyses are valid, such as in Beinoni form - ‘\cjRL’wklt’ (I/you/she eat.singular.feminine) and Tense [Past, Present, Future, Imperative, Infinitive].777Present-tense verbs and participles are tagged ‘Beinoni’.

Lemmas

Each segment is also assigned a lemma, i.e., the cannonical representation of its core (uninflected) meaning.888Note that due to high morphological fusion in Hebrew, simple surface-based stemming will not suffice. For Hebrew nouns and adjectives, the lemma is chosen to be the Masculine-Singular form. For verbs, the lemma is in the Masculine-Singular-3per form in Past tense.

Dependency Tree

The dependency tree is defined over all morphological segments and an artificial root node. It consists of a set of labeled binary relations that indicate the bi-lexical dependencies between segments.

Note that the SPMRL dependency scheme, as opposed to UD, always selects functional heads, rather than lexical heads. The dependency labeling is based on the scheme from Tsarfaty (2013), repeated in the appendix.

Lattices

As explained in section 3 above, a word can be segmented into morphemes in multiple ways, which are constrained by a broad-coverage lexicon. In addition to the parsed output, we makes available for each input sentence its sentence lattice, i.e. the set of all possible segmentations for a given sentence, along with all possible morphosyntactic analyses for each arc.

5 Technical Details and Forms of Use

YAP is implemented in the Go language.999https://golang.org/ It requires 6GB of RAM to run, and employs a simple 3-step installation, given in the supplementray material in the appendix. The input to the system is a tokenized sentence, with tokens appearing one per line, and a line break after every sentence.101010We assume the tokenization convention of MILA Itai and Wintner (2008). The output is a dependency tree (where each node in the tree is a lattice arc) provided in the CoNLL-X format Buchholz and Marsi (2006). YAP is trained on the Hebrew section of the SPMRL shared task. It also makes use of the broad-coverage lexicon of Itai and Wintner (2008) for finding all potential lattice paths. In case of out-of-vocabulary (OOV) items, we employ a simple heuristics where we suggest the 10 most-likely analyses of rare tokens observed during training.

Simple Use $|$ Command line

From the command line, one can process one input file at a time, with a single sentence or more. The input file must be formatted with a single token per line, and an empty line denoting the end of every sentence.

Processing a file is done in 2 steps: First, run Morphological Analysis ./yap hebma to generates a sentence lattice containing all possible morphological breakdowns of each token. YAP will save the lattice to the file specified via the -out flag.

Now you can run joint Morphological Disambiguation and Dependency Parsing ./yap joint to jointly predict the best lattice path and corresponding dependency tree. The input to this command is the output file generated in the previous step, and there are 3 output files: one containing word segments, one containing the disambiguated lattice path, and one containing the complete dependency tree in CoNLL-X format.

Advanced Use $|$ RESTful API

YAP can run as a RESTful server that accepts parse requests. To do this simply start the server, listening on localhost port 8000. Now you can call the joint endpoint with a json object containing the list of tokens to process in the HTTP data payload. The response is a json object containing the three output levels (MA, MD and Dep). You can use jq and sed (or any other json and line processing tools) to format the (tab separated value) responses and reassemble the output. Check our appendix for an illustration.

Educational Use $|$ The Online Demo

In 2018 we decided to create an online demo of the system, for educational purposes: (i) To exposed NLP/AI researchers to NLP capabilities available for Hebrew. (ii) To educate non-CS scientists and engineers who work with Hebrew data (e.g., digital humanities) on text annotations that can potentially be useful for their applications. (iii) To launch outreach activities where we teach what is NLP to the local community (e.g., school kids).111111E.g., https://www.youtube.com/watch?v=TFwQeoKpznA&feature=youtu.be

To use the demo, simply go to onlp.openu.ac.il and type Hebrew sentence in the textbox. The demo is built with Django and Bootstrap web frameworks. It sends the user’s Hebrew text input to the Onlp server, which returns a CoNLL-X formatted parse along with the complete sentence lattice. Pre-processing includes pre-morphological tokenization of the input, where punctuation is being separated from the tokens. Double quotation marks are being separated from the word unless they appear before the last character of the word, to avoid over-segmentation of acronyms.121212Acronyms in Hebrew are written with a quotation mark before the last letter, e.g. ‘\cjRLb”\cjRL’rh’ (USA) . The tokenized sequence is then passed to the Onlp server. The CoNLL-X output is then processed into the following layers: the FORM column is concatenated and presented as ”Segmented Text”, and the POS, LEMMA, FEATS and DEPS are presented in separate accordion tabs.

Furthermore, the demo presents the sentence lattice which is the input to the joint parser. This is useful for debugging, and for analyzing lexical-coverage in out-of-domain scenarios.

Expert Use $|$ Out of Domain Scenarios

A bottleneck for the system in out-of-domain parsing scenarios is the coverage of the lexicon. We rely on a general-purpose lexicon containing over 500K entries. OOV words are treated via heuristics we designed, which are suitable for the general case only. However, identifying accurately vocabulary items may be critical when applying the parser to new domains with domain-specific information (medical, financial, political, etc.). Fortunately, we can extend the system with a domain-specific lexicon, thus extending the MA coverage. Due to joint inference, the availability of a better suited lexical analysis triggers better lexico-syntactic decisions on the whole.131313We discuss how exactly this is executed in the appendix.

6 Related and Future Work

Hebrew NLP in general and Hebrew parsing in particular are known to be challenging, due to interesting linguistic properties, the scarcity of annotated data, and the small research community around. So, Hebrew has been seriously under-studied in NLP. During the early 2000, the MILA knowledge center was established, where the two of the main Hebrew resources for NLP were developed: the Hebrew treebank Sima’an et al. (2001) and the Hebrew Lexicon Itai and Wintner (2008).

Morphological Taggers for Hebrew using local linear-context have been trained on these data and were made available for free use Adler and Elhadad (2006); Bar-haim et al. (2008). However, their performance was not on a par with parallel tools for English and thus insufficient for commercial use. Hebrew dependency parsing was initially provided by Goldberg and Elhadad (2009), but the parser provides unlabeled dependency, and the pipeline relied on Adler’s morphological tagger. This left the automatic dependency trees inaccurate and unsatisfying. Joint morpho-syntactic models for constituency-based parsing models Tsarfaty (2010) showed good performance on benchmark data, but their code was never released for open use.

With the development of the UD treebanks collection, general frameworks such as UDPipe Straka et al. (2016) and CoreNLP Manning et al. (2014) have been trained on the Hebrew UD treebank, and made the model available. However, these models provide performance that is still far from satisfactory, As we also demonstrate in our screen-cast,141414https://www.youtube.com/watch?v=H6pvh1x20FQ these systems make very basic mistakes, even with the simplest sentence. We conjecture that this is due to their inherent pipeline assumption: initial layers of processing present many mistakes. due to the extreme morphological ambiguity, and later layers cannot recover. Notably, also neural network models utilizing word embeddings, (e.g., UDPipe) still lag behind.

Table 1 shows the task-coverage of existing tools and toolkits for NLP in Hebrew, academic as well as private initiatives (NITE,Hebrew-NLP). The task-coverage of the Onlp suite we present is on a par with international standards (UDPipe, CoreNLP), and its level of performance was shown to exceed all existing models More et al. (2019). We are currently working towards Named-Entity Recognition as well as Open Information Extraction, to be added to Onlp in the near future.

7 Conclusion

This paper presents Onlp, a complete language-processing framework for automatic annotation of Modern Hebrew Texts. The framework covers morphological segmentation, POS tags, lemmas and features, and dependency parsing, predicted jointly. The system is easy to install and to use, and we support multiple forms of usage fitting user-personas with different needs. We hope the availability of an open-source, accurate, and easy-to-use system for NLP in Hebrew will benefit the local NLP open-source community and greatly advance Hebrew language technology research and development, in academia and in the industry.

Acknowledgements

We thank the NLPH community, in particular Shay Palachi, Amit Shkolnick and Yuval Feinstein, for much discussion and insightful comments. We further thank the Avi Bivas (Innovation Authority) and Milo Avisar for promoting NLP initiatives in Israel. This research is supported by an ISF grant (1739/26) and an ERC Starting grant (677352), for which we are grateful.

Supplementary Material For EMNLP Demo Paper

These supplementary materials document the absolute essentials for starting to use the system: installation, annotation scheme documentation, forms of use, and enhancements for out-of-domains scenarios.

Appendix A Resources

1. YAP Github:

https://github.com/OnlpLab/yap

YAP Demo - Website:

http://onlp.openu.org.il

YAP Demo - Screencast: (Youtube)

https://www.youtube.com/watch?v=H6pvh1x20FQ

YAP Python-Wrapper:

https://github.com/amit-shkolnik/YAP-Wrapper

SPMRL-to-UD Conversion:

https://github.com/OnlpLab/Hebrew_UD

ONLP Lab Website:

http://onlp.openu.org.il/home

Appendix B Screen-Cast

Check out our screen-cast online demo at: https://www.youtube.com/watch?v=H6pvh1x20FQ

Appendix C Morphological Ambiguity: Lattices

Table 2 shows a sentence lattice capturing the high ambiguity of Hebrew morphological analysis. For a simple 3-tokens input sentence, 22 possible arcs present valid analyses of the various tokens. A single consecutive path through the lattice needs to be selected, for the sentence to be further processed by syntactic parsers or downstream applications.

Appendix D Annotation Layers

The annotation scheme provided by Onlp corresponds to the Hebrew section of the SPMRL shared task. 2013-2014151515http://www.spmrl.org/spmrl2013-sharedtask.html The Part-of-Speech Tags we employ are provided, along with illustrative examples, in Table 3. The Dependency labels are defined and illustrated in Table 5.

Appendix E The Online Demo

In Figure 3 we present a screen capture of the Morphological Segmentation, POS tags and Dependency Relations for two raw input sentences:

•

‘\cjRLhbn /skb b.sl‘

’the-boy was-lying in-the-shade’

•

‘\cjRLhbn /snm b.sl‘

’the-boy that-was-napping in-the-shade’

As executed on our demo page. Note that the two raw sentences have very similar form (in fact, they only differ in two characters). But they end up forming very different syntactic structures, which the Onlp system annotates correctly.

Appendix F Forms of Use

Figures 4–6 present the usage patterns with the YAP parser, the core algorithm of the framework. In Figure 4 we present the 3-step installation, in Figure 5 we show a simple command-line use, and in Figure 6 we show how to use YAP as a service. As noted before, The input file must be formatted with a single token per line and an empty line denoting end of sentence.161616Crucially, the last line in the file must be empty to denote the end of the last sentence.171717A note for Windows users: YAP doesn’t handle Windows style text files that have BOM marks and CRLF newlines. So if you’re running on Windows and YAP doesn’t work, make sure you don’t have CRLF line endings and no BOM marks. YAP has been written in Go in order to enable multi-threading. This means that it can be called from multiple threads in parallel. As of June 2019 there is also a python wrapper, created by members of the Israeli open-source community.181818The Credit goes to Amit Shkolnik of the 4girls initiative. Further details can be found here: https://github.com/amit-shkolnik/YAP-Wrapper

Appendix G Out-of-Domain Scenarios

When observing errors in a new domain, one first thing we have to check is whether or not these are due to lexical gaps. I.e., whether they stem from lack of coverage of the lexicon. The availability of the sentence lattice output is of great value in this respect. By reviewing the lattice, it is possible to see whether the lexicon contains the correct morphological analysis for the input token at all. If the correct analysis is not in the lattice, it is easy to add the missing analyses by editing the lexicon.191919The lexicon file located at data/bgulex/bgulex.utf8.hr

Each line in the lexicon file contains a token followed by a list of one or more possible morphological analyses of that token. An analysis is a tuple made of 3 parts $\langle$ prefix:host:suffix $\rangle$ followed by the host lemma. Each tuple member contains the part-of-speech tag and morphological features for any of these elements. prefix and suffix can possibly be empty. E.g.

$>$ \cjRL*”bd :VB-MF-S-1-FUTURE-NIFAL: \cjRLn’bd :VB-MF-S-1-FUTURE-PIEL: \cjRL’ybd*

An example use case could arise when processing medical domain texts related to cancer in which the word ‘\cjRLlymph‘ (lymph) appears in the text but is missing from the lexicon. In this case, the parser errs in identifying the first ‘\cjRLl’ as the preposition ”to”, followed by a proper noun.

To remedy this, we can update the lexicon by adding the following line:

$>$ \cjRL*lymph :NN-F-S: \cjRLlymph

*This means that the token \cjRLlymph is a common noun with feminine gender and singular number, followed by the lemma, and that it is unambigous (i.e., only one analysis is available). Note that after updating the lexicon you need to restart YAP (if running as a restful server) for the lexical changes to apply.

Now that \cjRLlymph is no longer an OOV, sentences containing this token will be given a more accurate lattice and as a result will be analyzed with a global syntactic structure that accords with the correct analysis. We suggested these lexicon edits for our users working in specific domains in the industry (medical, social, political), and they attested to significant improvements when running on particular domains.202020Yuval Feinstain, NLP Consultant, p.c.

Bibliography20

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Adler and Elhadad (2006) Meni Adler and Michael Elhadad. 2006. An unsupervised morpheme-based hmm for Hebrew morphological disambiguation. In ACL . The Association for Computer Linguistics.
2Bar-haim et al. (2008) Roy Bar-haim, Khalil Sima’an, and Yoad Winter. 2008. Part-of-speech tagging of Modern Hebrew text. Natural Language Engineering , 14(2):223–251.
3Buchholz and Marsi (2006) Sabine Buchholz and Erwin Marsi. 2006. Conll-x shared task on multilingual dependency parsing. In Proceedings of Co NLL , pages 149–164.
4Goldberg and Elhadad (2009) Yoav Goldberg and Michael Elhadad. 2009. Hebrew dependency parsing: Initial results . In Proceedings of the 11th International Conference on Parsing Technologies , IWPT ’09, pages 129–133.
5Goldberg and Elhadad (2011) Yoav Goldberg and Michael Elhadad. 2011. Joint Hebrew segmentation and parsing using a PCFGLA lattice parser. In Proceedings of ACL .
6Goldberg and Tsarfaty (2008) Yoav Goldberg and Reut Tsarfaty. 2008. A single framework for joint morphological segmentation and syntactic parsing. In Proceedings of ACL .
7Green and Manning (2010) Spence Green and Christopher D. Manning. 2010. Better Arabic parsing: Baselines, evaluations, and analysis. In Proceedings of COLING .
8Honnibal and Montani (2017) Matthew Honnibal and Ines Montani. 2017. spa Cy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

What’s Wrong with Hebrew NLP?

Abstract

1 Introduction

2 The Linguistic Challenge

3 The Architectural Design

4 The Annotation Scheme

Morphological Segmentation

Part-of-Speech (POS) Tags

Morphological Features

Lemmas

Dependency Tree

Lattices

5 Technical Details and Forms of Use

Simple Use ∣|∣ Command line

Advanced Use ∣|∣ RESTful API

Educational Use ∣|∣ The Online Demo

Expert Use ∣|∣ Out of Domain Scenarios

6 Related and Future Work

7 Conclusion

Acknowledgements

Supplementary Material For EMNLP Demo Paper

Appendix A Resources

Appendix B Screen-Cast

Appendix C Morphological Ambiguity: Lattices

Appendix D Annotation Layers

Appendix E The Online Demo

Appendix F Forms of Use

Appendix G Out-of-Domain Scenarios

Simple Use $|$ Command line

Advanced Use $|$ RESTful API

Educational Use $|$ The Online Demo

Expert Use $|$ Out of Domain Scenarios