Computational Linguistics Meets Libyan Dialect: A Study on Dialect Identification
Mansour Essgaer, Khamis Massud, Rabia Al Mamlook, Najah Ghmaid

TL;DR
This paper evaluates various machine learning models for classifying Libyan dialects from Twitter data, finding that Multinomial Naive Bayes with specific n-gram features achieves the highest accuracy of 85.89%, providing empirical benchmarks for Arabic dialect identification.
Contribution
It introduces an empirical comparison of classifiers and feature representations for Libyan dialect identification, highlighting the effectiveness of Multinomial Naive Bayes with n-gram features.
Findings
Multinomial Naive Bayes achieved 85.89% accuracy.
(1,2) word and (1,5) character n-grams are most effective.
MNB outperformed Logistic Regression and SVM.
Abstract
This study investigates logistic regression, linear support vector machine, multinomial Naive Bayes, and Bernoulli Naive Bayes for classifying Libyan dialect utterances gathered from Twitter. The dataset used is the QADI corpus, which consists of 540,000 sentences across 18 Arabic dialects. Preprocessing challenges include handling inconsistent orthographic variations and non-standard spellings typical of the Libyan dialect. The chi-square analysis revealed that certain features, such as email mentions and emotion indicators, were not significantly associated with dialect classification and were thus excluded from further analysis. Two main experiments were conducted: (1) evaluating the significance of meta-features extracted from the corpus using the chi-square test and (2) assessing classifier performance using different word and character n-gram representations. The classification…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLinguistic Variation and Morphology · Authorship Attribution and Profiling · Natural Language Processing Techniques
