Toward Micro-Dialect Identification in Diaglossic and Code-Switched   Environments

Muhammad Abdul-Mageed; Chiyu Zhang; AbdelRahim Elmadany; Lyle; Ungar

arXiv:2010.04900·cs.CL·December 8, 2020

Toward Micro-Dialect Identification in Diaglossic and Code-Switched Environments

Muhammad Abdul-Mageed, Chiyu Zhang, AbdelRahim Elmadany, Lyle, Ungar

PDF

1 Repo

TL;DR

This paper introduces the novel task of Micro-Dialect Identification (MDI) in diaglossic and code-switched environments, proposing MARBERT, a specialized language model that accurately predicts fine-grained dialects, demonstrated on a new Arabic dataset.

Contribution

The paper presents MARBERT, a new language model designed for micro-dialect prediction, along with a large-scale Arabic micro-varieties dataset, advancing fine-grained dialect identification.

Findings

01

MARBERT achieves 9.9% F1 score, outperforming baselines by 76 times.

02

Introduces a new large-scale Arabic micro-varieties dataset.

03

Establishes new state-of-the-art on multiple external tasks.

Abstract

Although the prediction of dialects is an important language processing task, with a wide range of applications, existing work is largely limited to coarse-grained varieties. Inspired by geolocation research, we propose the novel task of Micro-Dialect Identification (MDI) and introduce MARBERT, a new language model with striking abilities to predict a fine-grained variety (as small as that of a city) given a single, short message. For modeling, we offer a range of novel spatially and linguistically-motivated multi-task learning models. To showcase the utility of our models, we introduce a new, large-scale dataset of Arabic micro-varieties (low-resource) suited to our tasks. MARBERT predicts micro-dialects with 9.9% F1, ~76X better than a majority class baseline. Our new language model also establishes new state-of-the-art on several external tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

UBC-NLP/microdialects
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.