USTHB at NADI 2023 shared task: Exploring Preprocessing and Feature Engineering Strategies for Arabic Dialect Identification
Mohamed Lichouri, Khaled Lounnas, Aicha Zitouni, Houda Latrache,, Rachida Djeradi

TL;DR
This paper analyzes preprocessing and feature engineering techniques for Arabic dialect identification, achieving an F1 score of 62.51%, and compares their impact on classification performance.
Contribution
It systematically evaluates the effects of various preprocessing and feature strategies on Arabic dialect identification performance.
Findings
Preprocessing and feature engineering significantly influence classification accuracy.
The system achieved an F1 score of 62.51%, close to the average of 72.91%.
Different feature combinations impact model performance variably.
Abstract
In this paper, we conduct an in-depth analysis of several key factors influencing the performance of Arabic Dialect Identification NADI'2023, with a specific focus on the first subtask involving country-level dialect identification. Our investigation encompasses the effects of surface preprocessing, morphological preprocessing, FastText vector model, and the weighted concatenation of TF-IDF features. For classification purposes, we employ the Linear Support Vector Classification (LSVC) model. During the evaluation phase, our system demonstrates noteworthy results, achieving an F1 score of 62.51%. This achievement closely aligns with the average F1 scores attained by other systems submitted for the first subtask, which stands at 72.91%.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Authorship Attribution and Profiling · Speech Recognition and Synthesis
MethodsFocus · fastText
