The SIGMORPHON 2022 Shared Task on Morpheme Segmentation

Khuyagbaatar Batsuren; G\'abor Bella; Aryaman Arora; Viktor; Martinovi\'c; Kyle Gorman; Zden\v{e}k \v{Z}abokrtsk\'y; Amarsanaa Ganbold,; \v{S}\'arka Dohnalov\'a; Magda \v{S}ev\v{c}\'ikov\'a; Kate\v{r}ina; Pelegrinov\'a; Fausto Giunchiglia; Ryan Cotterell; Ekaterina Vylomova

arXiv:2206.07615·cs.CL·June 16, 2022

The SIGMORPHON 2022 Shared Task on Morpheme Segmentation

Khuyagbaatar Batsuren, G\'abor Bella, Aryaman Arora, Viktor, Martinovi\'c, Kyle Gorman, Zden\v{e}k \v{Z}abokrtsk\'y, Amarsanaa Ganbold,, \v{S}\'arka Dohnalov\'a, Magda \v{S}ev\v{c}\'ikov\'a, Kate\v{r}ina, Pelegrinov\'a, Fausto Giunchiglia, Ryan Cotterell, Ekaterina Vylomova

PDF

1 Repo

TL;DR

The SIGMORPHON 2022 shared task evaluated systems on morpheme segmentation across multiple languages and tasks, achieving high accuracy and outperforming existing subword tokenization methods, with resources openly shared for future research.

Contribution

This paper introduces a comprehensive shared task on morpheme segmentation, providing new benchmarks, datasets, and evaluation results across diverse languages and morphological phenomena.

Findings

01

Best system achieved 97.29% F1 score in word-level segmentation.

02

Systems outperformed state-of-the-art subword tokenization methods by 30.71%.

03

Resources and datasets are publicly released for future research.

Abstract

The SIGMORPHON 2022 shared task on morpheme segmentation challenged systems to decompose a word into a sequence of morphemes and covered most types of morphology: compounds, derivations, and inflections. Subtask 1, word-level morpheme segmentation, covered 5 million words in 9 languages (Czech, English, Spanish, Hungarian, French, Italian, Russian, Latin, Mongolian) and received 13 system submissions from 7 teams and the best system averaged 97.29% F1 score across all languages, ranging English (93.84%) to Latin (99.38%). Subtask 2, sentence-level morpheme segmentation, covered 18,735 sentences in 3 languages (Czech, English, Mongolian), received 10 system submissions from 3 teams, and the best systems outperformed all three state-of-the-art subword tokenization methods (BPE, ULM, Morfessor2) by 30.71% absolute. To facilitate error analysis and support any type of future studies, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sigmorphon/2022segmentationst
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.