A Catalog of Basque Dialectal Resources: Online Collections and Standard-to-Dialectal Adaptations
Jaione Bengoetxea, Itziar Gonzalez-Dios, Rodrigo Agerri

TL;DR
This paper compiles a comprehensive catalog of Basque dialectal resources, including online data and standard-to-dialect adaptations, to support dialectal NLP research and address data scarcity.
Contribution
It introduces a systematic catalog of Basque dialectal data, including manually and automatically adapted datasets, with quality assessments and new parallel evaluation datasets.
Findings
Created a high-quality parallel gold standard dataset for Basque dialects.
Evaluated the quality of automatically adapted dialectal data.
Provided a comprehensive online resource compilation.
Abstract
Recent research on dialectal NLP has identified data scarcity as a primary limitation. To address this limitation, this paper presents a catalog of contemporary Basque dialectal data and resources, offering a systematic and comprehensive compilation of the dialectal data currently available in Basque. Two types of data sources have been distinguished: online data originally written in some dialect, and standard-to-dialect adapted data. The former includes all dialectal data that can be found online, such as news and radio sites, informal tweets, as well as online resources such as dictionaries, atlases, grammar rules, or videos. The latter consists of data that has been adapted from the standard variety to dialectal varieties, either manually or automatically. Regarding the manual adaptation, the test split of the XNLI Natural Language Inference dataset was manually adapted into three…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBasque language and culture studies · Linguistic Variation and Morphology · Natural Language Processing Techniques
