A Survey on Spoken Italian Datasets and Corpora

Marco Giordano; Claudia Rinaldi

arXiv:2501.06557·cs.CL·March 13, 2025

A Survey on Spoken Italian Datasets and Corpora

Marco Giordano, Claudia Rinaldi

PDF

TL;DR

This survey comprehensively reviews 66 spoken Italian datasets, analyzing their features, applications, and challenges, to support future research and development in Italian speech technology and linguistic studies.

Contribution

It provides the first detailed categorization and analysis of Italian spoken language datasets, highlighting gaps and offering recommendations for future dataset creation and use.

Findings

01

Identified key characteristics and applications of Italian speech datasets

02

Highlighted challenges of dataset scarcity and accessibility

03

Provided a publicly accessible dataset inventory

Abstract

Spoken language datasets are vital for advancing linguistic research, Natural Language Processing, and speech technology. However, resources dedicated to Italian, a linguistically rich and diverse Romance language, remain underexplored compared to major languages like English or Mandarin. This survey provides a comprehensive analysis of 66 spoken Italian datasets, highlighting their characteristics, methodologies, and applications. The datasets are categorized by speech type, source and context, and demographic and linguistic features, with a focus on their utility in fields such as Automatic Speech Recognition, emotion detection, and education. Challenges related to dataset scarcity, representativeness, and accessibility are discussed alongside recommendations for enhancing dataset creation and utilization. The full dataset inventory is publicly accessible via GitHub and archived on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsFocus