TL;DR
This paper investigates how language similarity and script differences affect cross-lingual transfer in NLP tasks using a new Algerian dialect corpus with multiple scripts, revealing a complex relationship especially for POS tagging.
Contribution
It introduces a novel multi-layer Algerian dialect corpus with parallel annotations across scripts and explores the impact of script and typological similarity on transfer performance.
Findings
Script and typology influence POS transfer differently
Sentiment analysis is less affected by script and typology differences
Fine-tuning multilingual models reveals nuanced effects of language and script
Abstract
Recent years have seen a rise in interest for cross-lingual transfer between languages with similar typology, and between languages of various scripts. However, the interplay between language similarity and difference in script on cross-lingual transfer is a less studied problem. We explore this interplay on cross-lingual transfer for two supervised tasks, namely part-of-speech tagging and sentiment analysis. We introduce a newly annotated corpus of Algerian user-generated comments comprising parallel annotations of Algerian written in Latin, Arabic, and code-switched scripts, as well as annotations for sentiment and topic categories. We perform baseline experiments by fine-tuning multi-lingual language models. We further explore the effect of script vs. language similarity in cross-lingual transfer by fine-tuning multi-lingual models on languages which are a) typologically distinct,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
