Modeling Orthographic Variation in Occitan's Dialects
Zachary William Hopton (Language, Space Lab, University of Zurich), and No\"emi Aepli (Department of Computational Linguistics, University of, Zurich)

TL;DR
This paper demonstrates that large multilingual models can effectively normalize and process low-resource dialectal languages like Occitan, reducing the need for explicit spelling normalization and handling dialectal variation robustly.
Contribution
It introduces a fine-tuned multilingual model for Occitan dialects and shows its effectiveness in normalization and syntactic tasks across dialects, with minimal preprocessing.
Findings
Model embeddings reflect surface similarity among dialects.
Fine-tuned models perform well on POS tagging and parsing across dialects.
Multilingual models reduce the need for spelling normalization.
Abstract
Effectively normalizing textual data poses a considerable challenge, especially for low-resource languages lacking standardized writing systems. In this study, we fine-tuned a multilingual model with data from several Occitan dialects and conducted a series of experiments to assess the model's representations of these dialects. For evaluation purposes, we compiled a parallel lexicon encompassing four Occitan dialects. Intrinsic evaluations of the model's embeddings revealed that surface similarity between the dialects strengthened representations. When the model was further fine-tuned for part-of-speech tagging and Universal Dependency parsing, its performance was robust to dialectical variation, even when trained solely on part-of-speech data from a single dialect. Our findings suggest that large multilingual models minimize the need for spelling normalization during pre-processing.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMedieval European Literature and History · Basque language and culture studies · Linguistic Variation and Morphology
