RoDia: A New Dataset for Romanian Dialect Identification from Speech
Codrut Rotaru, Nicolae-Catalin Ristea, Radu Tudor Ionescu

TL;DR
RoDia is the first Romanian dialect identification dataset from speech, covering five regions and providing baseline models, aiming to advance research in this challenging task.
Contribution
Introduction of RoDia, the first comprehensive Romanian dialect speech dataset, along with baseline models to facilitate future dialect identification research.
Findings
Top model achieved 59.83% macro F1 score
Task remains challenging with moderate accuracy
Dataset covers five Romanian regions with diverse speech samples
Abstract
We introduce RoDia, the first dataset for Romanian dialect identification from speech. The RoDia dataset includes a varied compilation of speech samples from five distinct regions of Romania, covering both urban and rural environments, totaling 2 hours of manually annotated speech data. Along with our dataset, we introduce a set of competitive models to be used as baselines for future research. The top scoring model achieves a macro F1 score of 59.83% and a micro F1 score of 62.08%, indicating that the task is challenging. We thus believe that RoDia is a valuable resource that will stimulate research aiming to address the challenges of Romanian dialect identification. We release our dataset at https://github.com/codrut2/RoDia.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis
