RUMLEM: A Dictionary-Based Lemmatizer for Romansh

Dominic P. Fischer; Zachary Hopton; Jannis Vamvas

arXiv:2604.11233·cs.CL·April 14, 2026

RUMLEM: A Dictionary-Based Lemmatizer for Romansh

Dominic P. Fischer, Zachary Hopton, Jannis Vamvas

PDF

TL;DR

RUMLEM is a comprehensive, community-driven lemmatizer for Romansh that covers multiple varieties and enables high-accuracy language classification and variety detection.

Contribution

It introduces a new lemmatizer based on morphological databases for Romansh, covering five varieties and standard Rumantsch Grischun, with applications in language classification.

Findings

01

Covers 77-84% of words in Romansh texts.

02

Correctly identifies language variety in 95% of cases.

03

Demonstrates feasibility of Romansh vs. non-Romansh classification.

Abstract

Lemmatization -- the task of mapping an inflected word form to its dictionary form -- is a crucial component of many NLP applications. In this paper, we present RUMLEM, a lemmatizer that covers the five main varieties of Romansh as well as the supra-regional standard variety Rumantsch Grischun. It is based on comprehensive, community-driven morphological databases for Romansh, enabling RUMLEM to cover 77-84% of the words in a typical Romansh text. Since there is a dedicated database for each Romansh variety, an additional application of RUMLEM is variety-aware language classification. Evaluation on 30'000 Romansh texts of varying lengths shows that RUMLEM correctly identifies the variety in 95% of cases. In addition, a proof of concept demonstrates the feasibility of Romansh vs. non-Romansh language classification based on the lemmatizer.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.