RUMLEM: A Dictionary-Based Lemmatizer for Romansh
Dominic P. Fischer, Zachary Hopton, Jannis Vamvas

TL;DR
RUMLEM is a comprehensive, community-driven lemmatizer for Romansh that covers multiple varieties and enables high-accuracy language classification and variety detection.
Contribution
It introduces a new lemmatizer based on morphological databases for Romansh, covering five varieties and standard Rumantsch Grischun, with applications in language classification.
Findings
Covers 77-84% of words in Romansh texts.
Correctly identifies language variety in 95% of cases.
Demonstrates feasibility of Romansh vs. non-Romansh classification.
Abstract
Lemmatization -- the task of mapping an inflected word form to its dictionary form -- is a crucial component of many NLP applications. In this paper, we present RUMLEM, a lemmatizer that covers the five main varieties of Romansh as well as the supra-regional standard variety Rumantsch Grischun. It is based on comprehensive, community-driven morphological databases for Romansh, enabling RUMLEM to cover 77-84% of the words in a typical Romansh text. Since there is a dedicated database for each Romansh variety, an additional application of RUMLEM is variety-aware language classification. Evaluation on 30'000 Romansh texts of varying lengths shows that RUMLEM correctly identifies the variety in 95% of cases. In addition, a proof of concept demonstrates the feasibility of Romansh vs. non-Romansh language classification based on the lemmatizer.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
