GRDD: A Dataset for Greek Dialectal NLP

Stergios Chatzikyriakidis; Chatrine Qwaider; Ilias Kolokousis; Christina Koula; Dimitris Papadakis; Efthymia Sakellariou

arXiv:2308.00802·cs.CL·October 15, 2025

GRDD: A Dataset for Greek Dialectal NLP

Stergios Chatzikyriakidis, Chatrine Qwaider, Ilias Kolokousis, Christina Koula, Dimitris Papadakis, Efthymia Sakellariou

PDF

Open Access 1 Repo

TL;DR

This paper introduces a large-scale dataset for Greek dialects, enabling dialect identification with high accuracy using simple machine learning models, and highlights the importance of data cleaning.

Contribution

It provides the first extensive dialectal dataset for Modern Greek and demonstrates effective dialect identification methods using traditional ML and simple DL models.

Findings

01

High performance in dialect identification with simple ML models

02

Dataset reveals distinct dialectal features

03

Errors often due to dataset cleaning issues

Abstract

In this paper, we present a dataset for the computational study of a number of Modern Greek dialects. It consists of raw text data from four dialects of Modern Greek, Cretan, Pontic, Northern Greek and Cypriot Greek. The dataset is of considerable size, albeit imbalanced, and presents the first attempt to create large scale dialectal resources of this type for Modern Greek dialects. We then use the dataset to perform dialect idefntification. We experiment with traditional ML algorithms, as well as simple DL architectures. The results show very good performance on the task, potentially revealing that the dialects in question have distinct enough characteristics allowing even simple ML models to perform well on the task. Error analysis is performed for the top performing algorithms showing that in a number of cases the errors are due to insufficient dataset cleaning.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

stergioscha/greek_dialect_corpus
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLinguistic Variation and Morphology · Natural Language Processing Techniques · Gender Studies in Language