Euska\~nolDS: A Naturally Sourced Corpus for Basque-Spanish Code-Switching
Maite Heredia, Jeremy Barnes, Aitor Soroa

TL;DR
This paper introduces EuskanolDS, a novel corpus of naturally occurring Basque-Spanish code-switching data, created through a combination of automated identification and manual validation, to support NLP research in this language pair.
Contribution
It presents the first naturally sourced Basque-Spanish code-switching corpus, including methodology for data collection and validation, filling a critical resource gap in NLP research.
Findings
Corpus contains X number of CS instances.
Methodology achieves Y% accuracy in identifying CS texts.
Corpus is publicly available for research use.
Abstract
Code-switching (CS) remains a significant challenge in Natural Language Processing (NLP), mainly due a lack of relevant data. In the context of the contact between the Basque and Spanish languages in the north of the Iberian Peninsula, CS frequently occurs in both formal and informal spontaneous interactions. However, resources to analyse this phenomenon and support the development and evaluation of models capable of understanding and generating code-switched language for this language pair are almost non-existent. We introduce a first approach to develop a naturally sourced corpus for Basque-Spanish code-switching. Our methodology consists of identifying CS texts from previously available corpora using language identification models, which are then manually validated to obtain a reliable subset of CS instances. We present the properties of our corpus and make it available under the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Basque language and culture studies · Spanish Linguistics and Language Studies
