Robust Language Identification for Romansh Varieties

Charlotte Model; Sina Ahmadi; Jannis Vamvas

arXiv:2603.15969·cs.CL·May 6, 2026

Robust Language Identification for Romansh Varieties

Charlotte Model, Sina Ahmadi, Jannis Vamvas

PDF

1 Repo

TL;DR

This paper introduces a new SVM-based language identification system for Romansh dialects, achieving high accuracy and supporting applications like spell checking and translation.

Contribution

The paper presents the first effective LID system for Romansh idioms, including Rumantsch Grischun, with a new benchmark dataset and publicly available classifier.

Findings

01

Achieved 97% in-domain accuracy on benchmark data.

02

Successfully distinguished between Romansh idioms and Rumantsch Grischun.

03

Enabled applications such as idiom-aware spell checking and machine translation.

Abstract

The Romansh language has several regional varieties, called idioms, which sometimes have limited mutual intelligibility. Despite this linguistic diversity, there has been a lack of documented efforts to build a language identification (LID) system that can distinguish between these idioms. Since Romansh LID should also be able to recognize Rumantsch Grischun, a supra-regional variety that combines elements of several idioms, this makes for a novel and interesting classification problem. In this paper, we present a LID system for Romansh idioms based on an SVM approach. We evaluate our model on a newly curated benchmark across two domains and find that it reaches an average in-domain accuracy of 97%, enabling applications such as idiom-aware spell checking or machine translation. Our classifier is publicly available.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.