Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography

Gianluca Vico; Jind\v{r}ich Libovick\'y

arXiv:2602.14675·cs.CL·February 17, 2026

Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography

Gianluca Vico, Jind\v{r}ich Libovick\'y

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces a crowdsourced Piedmontese dataset with natural orthography to evaluate large language models on tokenization, classification, and translation tasks, revealing challenges and capabilities in handling low-resource, non-standard language data.

Contribution

The paper provides the first crowdsourced Piedmontese dataset with natural orthography and benchmarks LLM performance on multiple NLP tasks for this endangered language.

Findings

01

LLMs face tokenization penalties on Piedmontese compared to high-resource languages.

02

Classification performance of LLMs on Piedmontese approaches that on Italian, French, and English.

03

Translation from Piedmontese to high-resource languages is effective, but reverse translation is challenging.

Abstract

We present a crowdsourced dataset for Piedmontese, an endangered Romance language of northwestern Italy. The dataset comprises 145 Italian-Piedmontese parallel sentences derived from Flores+, with translations produced by speakers writing in their natural orthographic style rather than adhering to standardized conventions, along with manual word alignment. We use this resource to benchmark several large language models on tokenization parity, topic classification, and machine translation. Our analysis reveals that Piedmontese incurs a tokenization penalty relative to higher-resource Romance languages, yet LLMs achieve classification performance approaching that of Italian, French, and English. Machine translation results are asymmetric: models translate adequately from Piedmontese into high-resource languages, but generation into Piedmontese remains challenging. The dataset and code are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ufal/CrowdsourcingPiedmontese
dataset· 5 dl
5 dl

Videos

Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography· underline

Taxonomy

TopicsNatural Language Processing Techniques · Language and cultural evolution · Authorship Attribution and Profiling