Dialect2SQL: A Novel Text-to-SQL Dataset for Arabic Dialects with a Focus on Moroccan Darija
Salmane Chafik, Saad Ezzini, Ismail Berrada

TL;DR
This paper introduces Dialect2SQL, a large-scale dataset for translating Moroccan Arabic dialect questions into SQL queries, addressing challenges in low-resource language scenarios and supporting cross-domain applications.
Contribution
It presents the first extensive Arabic dialect text-to-SQL dataset, capturing dialect-specific complexities and diverse domain coverage to advance NLP models for low-resource languages.
Findings
Dataset contains 9,428 NLQ-SQL pairs across 69 databases.
Addresses dialect-specific linguistic challenges in SQL translation.
Supports development of NLP models for low-resource Arabic dialects.
Abstract
The task of converting natural language questions (NLQs) into executable SQL queries, known as text-to-SQL, has gained significant interest in recent years, as it enables non-technical users to interact with relational databases. Many benchmarks, such as SPIDER and WikiSQL, have contributed to the development of new models and the evaluation of their performance. In addition, other datasets, like SEDE and BIRD, have introduced more challenges and complexities to better map real-world scenarios. However, these datasets primarily focus on high-resource languages such as English and Chinese. In this work, we introduce Dialect2SQL, the first large-scale, cross-domain text-to-SQL dataset in an Arabic dialect. It consists of 9,428 NLQ-SQL pairs across 69 databases in various domains. Along with SQL-related challenges such as long schemas, dirty values, and complex queries, our dataset also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLinguistic Variation and Morphology · Language, Linguistics, Cultural Analysis · Natural Language Processing Techniques
MethodsFocus
