TL;DR
This study investigates how pre-training self-supervised speech models on Dutch enhances their ability to encode Dutch linguistic features, showing language-specific benefits over multilingual or English pre-training, with implications for speech recognition.
Contribution
It demonstrates that language-specific pre-training on Dutch improves linguistic feature encoding in self-supervised models compared to other pre-training data, highlighting the importance of language-specific training.
Findings
Dutch pre-training enhances phonetic and lexical encoding.
Language-specific pre-training outperforms multilingual or English pre-training.
Improved encoding correlates with better speech recognition performance.
Abstract
How language-specific are speech representations learned by self-supervised models? Existing work has shown that a range of linguistic features can be successfully decoded from end-to-end models trained only on speech recordings. However, it's less clear to what extent pre-training on specific languages improves language-specific linguistic information. Here we test the encoding of Dutch phonetic and lexical information in internal representations of self-supervised Wav2Vec2 models. Pre-training exclusively on Dutch improves the representation of Dutch linguistic features as compared to pre-training on similar amounts of English or larger amounts of multilingual data. This language-specific advantage is well-detected by trained clustering or classification probes, and partially observable using zero-shot metrics. Furthermore, the language-specific benefit on linguistic feature encoding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
