TL;DR
This paper introduces the novel task of Micro-Dialect Identification (MDI) in diaglossic and code-switched environments, proposing MARBERT, a specialized language model that accurately predicts fine-grained dialects, demonstrated on a new Arabic dataset.
Contribution
The paper presents MARBERT, a new language model designed for micro-dialect prediction, along with a large-scale Arabic micro-varieties dataset, advancing fine-grained dialect identification.
Findings
MARBERT achieves 9.9% F1 score, outperforming baselines by 76 times.
Introduces a new large-scale Arabic micro-varieties dataset.
Establishes new state-of-the-art on multiple external tasks.
Abstract
Although the prediction of dialects is an important language processing task, with a wide range of applications, existing work is largely limited to coarse-grained varieties. Inspired by geolocation research, we propose the novel task of Micro-Dialect Identification (MDI) and introduce MARBERT, a new language model with striking abilities to predict a fine-grained variety (as small as that of a city) given a single, short message. For modeling, we offer a range of novel spatially and linguistically-motivated multi-task learning models. To showcase the utility of our models, we introduce a new, large-scale dataset of Arabic micro-varieties (low-resource) suited to our tasks. MARBERT predicts micro-dialects with 9.9% F1, ~76X better than a majority class baseline. Our new language model also establishes new state-of-the-art on several external tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
