GLeMM: A large-scale multilingual dataset for morphological research
Hathout Nabil (CLLE, Comue de Toulouse), Basilio Calderone (CLLE, UBM), Fiammetta Namer (ATILF, UL), Franck Sajous (CLLE-ERSS, Comue de Toulouse)

TL;DR
GLeMM is a large, multilingual, automatically annotated morphological dataset from Wiktionary, designed to facilitate data-driven research and computational modeling of derivational morphology across seven European languages.
Contribution
It introduces GLeMM, a novel, extensive, automated morphological resource with semantic annotations, enabling new research in form-meaning relations in derivational morphology.
Findings
GLeMM covers seven European languages.
The dataset includes automatic morphological feature annotations.
Semantic descriptions are provided for a subset of entries.
Abstract
In derivational morphology, what mechanisms govern the variation in form-meaning relations between words? The answers to this type of questions are typically based on intuition and on observations drawn from limited data, even when a wide range of languages is considered. Many of these studies are difficult to replicate and generalize. To address this issue, we present GLeMM, a new derivational resource designed for experimentation and data-driven description in morphology. GLeMM is characterized by (i) its large size, (ii) its extensive coverage (currently amounting to seven European languages, i.e., German, English, Spanish, French, Italian, Polish, Russian, (iii) its fully automated design, identical across all languages, (iv) the automatic annotation of morphological features on each entry, as well as (v) the encoding of semantic descriptions for a significant subset of these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
