The SIGMORPHON 2022 Shared Task on Morpheme Segmentation
Khuyagbaatar Batsuren, G\'abor Bella, Aryaman Arora, Viktor, Martinovi\'c, Kyle Gorman, Zden\v{e}k \v{Z}abokrtsk\'y, Amarsanaa Ganbold,, \v{S}\'arka Dohnalov\'a, Magda \v{S}ev\v{c}\'ikov\'a, Kate\v{r}ina, Pelegrinov\'a, Fausto Giunchiglia, Ryan Cotterell, Ekaterina Vylomova

TL;DR
The SIGMORPHON 2022 shared task evaluated systems on morpheme segmentation across multiple languages and tasks, achieving high accuracy and outperforming existing subword tokenization methods, with resources openly shared for future research.
Contribution
This paper introduces a comprehensive shared task on morpheme segmentation, providing new benchmarks, datasets, and evaluation results across diverse languages and morphological phenomena.
Findings
Best system achieved 97.29% F1 score in word-level segmentation.
Systems outperformed state-of-the-art subword tokenization methods by 30.71%.
Resources and datasets are publicly released for future research.
Abstract
The SIGMORPHON 2022 shared task on morpheme segmentation challenged systems to decompose a word into a sequence of morphemes and covered most types of morphology: compounds, derivations, and inflections. Subtask 1, word-level morpheme segmentation, covered 5 million words in 9 languages (Czech, English, Spanish, Hungarian, French, Italian, Russian, Latin, Mongolian) and received 13 system submissions from 7 teams and the best system averaged 97.29% F1 score across all languages, ranging English (93.84%) to Latin (99.38%). Subtask 2, sentence-level morpheme segmentation, covered 18,735 sentences in 3 languages (Czech, English, Mongolian), received 10 system submissions from 3 teams, and the best systems outperformed all three state-of-the-art subword tokenization methods (BPE, ULM, Morfessor2) by 30.71% absolute. To facilitate error analysis and support any type of future studies, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
