Automatic Dialect Detection in Arabic Broadcast Speech

Ahmed Ali; Najim Dehak; Patrick Cardinal; Sameer Khurana; Sree Harsha; Yella; James Glass; Peter Bell; Steve Renals

arXiv:1509.06928·cs.CL·August 12, 2016

Automatic Dialect Detection in Arabic Broadcast Speech

Ahmed Ali, Najim Dehak, Patrick Cardinal, Sameer Khurana, Sree Harsha, Yella, James Glass, Peter Bell, Steve Renals

PDF

1 Repo

TL;DR

This paper explores various phonetic, lexical, and acoustic features for Arabic dialect identification, achieving perfect accuracy in binary classification and moderate accuracy in multi-dialect classification, and provides a new dataset for future research.

Contribution

It introduces a comprehensive approach combining multiple features and classifiers for Arabic dialect detection, and releases a standard dataset for benchmarking.

Findings

01

100% accuracy in Arabic/English language identification

02

100% accuracy in distinguishing MSA from Dialectal Arabic

03

52% accuracy in classifying five Arabic dialects

Abstract

We investigate different approaches for dialect identification in Arabic broadcast speech, using phonetic, lexical features obtained from a speech recognition system, and acoustic features using the i-vector framework. We studied both generative and discriminate classifiers, and we combined these features using a multi-class Support Vector Machine (SVM). We validated our results on an Arabic/English language identification task, with an accuracy of 100%. We used these features in a binary classifier to discriminate between Modern Standard Arabic (MSA) and Dialectal Arabic, with an accuracy of 100%. We further report results using the proposed method to discriminate between the five most widely used dialects of Arabic: namely Egyptian, Gulf, Levantine, North African, and MSA, with an accuracy of 52%. We discuss dialect identification errors in the context of dialect code-switching…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Qatar-Computing-Research-Institute/dialectID
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.