GLeMM: A large-scale multilingual dataset for morphological research

Hathout Nabil (CLLE; Comue de Toulouse); Basilio Calderone (CLLE; UBM); Fiammetta Namer (ATILF; UL); Franck Sajous (CLLE-ERSS; Comue de Toulouse)

arXiv:2604.12442·cs.CL·April 15, 2026

GLeMM: A large-scale multilingual dataset for morphological research

Hathout Nabil (CLLE, Comue de Toulouse), Basilio Calderone (CLLE, UBM), Fiammetta Namer (ATILF, UL), Franck Sajous (CLLE-ERSS, Comue de Toulouse)

PDF

TL;DR

GLeMM is a large, multilingual, automatically annotated morphological dataset from Wiktionary, designed to facilitate data-driven research and computational modeling of derivational morphology across seven European languages.

Contribution

It introduces GLeMM, a novel, extensive, automated morphological resource with semantic annotations, enabling new research in form-meaning relations in derivational morphology.

Findings

01

GLeMM covers seven European languages.

02

The dataset includes automatic morphological feature annotations.

03

Semantic descriptions are provided for a subset of entries.

Abstract

In derivational morphology, what mechanisms govern the variation in form-meaning relations between words? The answers to this type of questions are typically based on intuition and on observations drawn from limited data, even when a wide range of languages is considered. Many of these studies are difficult to replicate and generalize. To address this issue, we present GLeMM, a new derivational resource designed for experimentation and data-driven description in morphology. GLeMM is characterized by (i) its large size, (ii) its extensive coverage (currently amounting to seven European languages, i.e., German, English, Spanish, French, Italian, Polish, Russian, (iii) its fully automated design, identical across all languages, (iv) the automatic annotation of morphological features on each entry, as well as (v) the encoding of semantic descriptions for a significant subset of these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.