Dialetto, ma Quanto Dialetto? Transcribing and Evaluating Dialects on a Continuum
Ryan Soh-Eun Shim, Barbara Plank

TL;DR
This paper investigates the variation within dialects in NLP, specifically Italian dialects, revealing geographical and linguistic factors affecting speech-to-text performance and proposing methods to predict unseen dialect performance.
Contribution
It introduces a comprehensive analysis of within-dialect variation in NLP, linking geographical and linguistic factors to model performance disparities, and proposes geostatistical methods for zero-shot performance prediction.
Findings
Performance varies significantly within dialect categories.
Geographical proximity correlates with speech-to-text accuracy.
Geostatistical models improve prediction of unseen dialect performance.
Abstract
There is increasing interest in looking at dialects in NLP. However, most work to date still treats dialects as discrete categories. For instance, evaluative work in variation-oriented NLP for English often works with Indian English or African-American Venacular English as homogeneous categories (Faisal et al., 2024; Ziems et al., 2023), yet even within one variety there is substantial variation. We examine within-dialect variation and show that performance critically varies within categories. We measure speech-to-text performance on Italian dialects, and empirically observe a geographical performance disparity. This disparity correlates substantially (-0.5) with linguistic similarity to the highest performing dialect variety. We cross-examine our results against dialectometry methods, and interpret the performance disparity to be due to a bias towards dialects that are more similar to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsLinguistic Studies and Language Acquisition · Linguistic research and analysis · Linguistic Education and Pedagogy
