A Survey on Spoken Italian Datasets and Corpora
Marco Giordano, Claudia Rinaldi

TL;DR
This survey comprehensively reviews 66 spoken Italian datasets, analyzing their features, applications, and challenges, to support future research and development in Italian speech technology and linguistic studies.
Contribution
It provides the first detailed categorization and analysis of Italian spoken language datasets, highlighting gaps and offering recommendations for future dataset creation and use.
Findings
Identified key characteristics and applications of Italian speech datasets
Highlighted challenges of dataset scarcity and accessibility
Provided a publicly accessible dataset inventory
Abstract
Spoken language datasets are vital for advancing linguistic research, Natural Language Processing, and speech technology. However, resources dedicated to Italian, a linguistically rich and diverse Romance language, remain underexplored compared to major languages like English or Mandarin. This survey provides a comprehensive analysis of 66 spoken Italian datasets, highlighting their characteristics, methodologies, and applications. The datasets are categorized by speech type, source and context, and demographic and linguistic features, with a focus on their utility in fields such as Automatic Speech Recognition, emotion detection, and education. Challenges related to dataset scarcity, representativeness, and accessibility are discussed alongside recommendations for enhancing dataset creation and utilization. The full dataset inventory is publicly accessible via GitHub and archived on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFocus
