Morphology Matters: A Multilingual Language Modeling Analysis

Hyunji Hayley Park; Katherine J. Zhang; Coleman Haley; Kenneth; Steimel; Han Liu; Lane Schwartz

arXiv:2012.06262·cs.CL·March 29, 2021

Morphology Matters: A Multilingual Language Modeling Analysis

Hyunji Hayley Park, Katherine J. Zhang, Coleman Haley, Kenneth, Steimel, Han Liu, Lane Schwartz

PDF

1 Repo

TL;DR

This study investigates how morphological complexity affects multilingual language modeling, finding that certain morphological features increase surprisal in models, but linguistically-motivated segmentation strategies can mitigate this effect.

Contribution

It provides a comprehensive analysis of morphological influences on language modeling using a larger, more diverse dataset and compares segmentation strategies for improved performance.

Findings

01

Morphological complexity correlates with higher surprisal in models.

02

Linguistically-motivated segmentation reduces the impact of morphology.

03

FST and Morfessor segmentation outperform BPE in handling morphology.

Abstract

Prior studies in multilingual language modeling (e.g., Cotterell et al., 2018; Mielke et al., 2019) disagree on whether or not inflectional morphology makes languages harder to model. We attempt to resolve the disagreement and extend those studies. We compile a larger corpus of 145 Bible translations in 92 languages and a larger number of typological features. We fill in missing typological data for several languages and consider corpus-based measures of morphological complexity in addition to expert-produced typological features. We find that several morphological measures are significantly associated with higher surprisal when LSTM models are trained with BPE-segmented data. We also investigate linguistically-motivated subword segmentation strategies like Morfessor and Finite-State Transducers (FSTs) and find that these segmentation strategies yield better performance and reduce the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hayleypark/MorphologyMatters
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory