Quantifying Language Variation Acoustically with Few Resources
Martijn Bartelds, Martijn Wieling

TL;DR
This paper demonstrates that deep acoustic models can effectively distinguish regional dialects using minimal resources, outperforming transcription-based methods without needing phonetic transcriptions.
Contribution
It shows that pre-trained and fine-tuned wav2vec 2.0 models can quantify language variation acoustically with very limited data, especially in low-resource dialects.
Findings
Acoustic models outperform transcription-based approaches.
Multilingual XLSR-53 fine-tuned on Dutch yields best results.
Effective clustering achieved with only six seconds of speech.
Abstract
Deep acoustic models represent linguistic information based on massive amounts of data. Unfortunately, for regional languages and dialects such resources are mostly not available. However, deep acoustic models might have learned linguistic information that transfers to low-resource languages. In this study, we evaluate whether this is the case through the task of distinguishing low-resource (Dutch) regional varieties. By extracting embeddings from the hidden layers of various wav2vec 2.0 models (including new models which are pre-trained and/or fine-tuned on Dutch) and using dynamic time warping, we compute pairwise pronunciation differences averaged over 10 words for over 100 individual dialects from four (regional) languages. We then cluster the resulting difference matrix in four groups and compare these to a gold standard, and a partitioning on the basis of comparing phonetic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Natural Language Processing Techniques
