RoDia: A New Dataset for Romanian Dialect Identification from Speech

Codrut Rotaru; Nicolae-Catalin Ristea; Radu Tudor Ionescu

arXiv:2309.03378·cs.CL·March 22, 2024

RoDia: A New Dataset for Romanian Dialect Identification from Speech

Codrut Rotaru, Nicolae-Catalin Ristea, Radu Tudor Ionescu

PDF

Open Access 1 Video

TL;DR

RoDia is the first Romanian dialect identification dataset from speech, covering five regions and providing baseline models, aiming to advance research in this challenging task.

Contribution

Introduction of RoDia, the first comprehensive Romanian dialect speech dataset, along with baseline models to facilitate future dialect identification research.

Findings

01

Top model achieved 59.83% macro F1 score

02

Task remains challenging with moderate accuracy

03

Dataset covers five Romanian regions with diverse speech samples

Abstract

We introduce RoDia, the first dataset for Romanian dialect identification from speech. The RoDia dataset includes a varied compilation of speech samples from five distinct regions of Romania, covering both urban and rural environments, totaling 2 hours of manually annotated speech data. Along with our dataset, we introduce a set of competitive models to be used as baselines for future research. The top scoring model achieves a macro F1 score of 59.83% and a micro F1 score of 62.08%, indicating that the task is challenging. We thus believe that RoDia is a valuable resource that will stimulate research aiming to address the challenges of Romanian dialect identification. We release our dataset at https://github.com/codrut2/RoDia.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

RoDia: A New Dataset for Romanian Dialect Identification from Speech· underline

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis