Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation
Zoey Liu, Emily Prud'hommeaux

TL;DR
This study investigates how different morphological segmentation models generalize across languages and data variations in low-resource settings, emphasizing the importance of data set characteristics over size for reliable evaluation.
Contribution
It introduces a comprehensive analysis of model generalizability across multiple languages and data conditions, highlighting key data characteristics affecting performance.
Findings
Model generalization varies with data set characteristics.
Data set size is less influential than morpheme overlap.
Random sampling of data sets improves evaluation reliability.
Abstract
Common designs of model evaluation typically focus on monolingual settings, where different models are compared according to their performance on a single data set that is assumed to be representative of all possible data for the task at hand. While this may be reasonable for a large data set, this assumption is difficult to maintain in low-resource scenarios, where artifacts of the data collection can yield data sets that are outliers, potentially making conclusions about model performance coincidental. To address these concerns, we investigate model generalizability in crosslinguistic low-resource scenarios. Using morphological segmentation as the test case, we compare three broad classes of models with different parameterizations, taking data from 11 languages across 6 language families. In each experimental setting, we evaluate all models on a first data set, then examine their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
