What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training

Marianne de Heer Kloots; Hosein Mohebbi; Charlotte Pouw; Gaofei Shen; Willem Zuidema; Martijn Bentum

arXiv:2506.00981·cs.CL·July 11, 2025

What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training

Marianne de Heer Kloots, Hosein Mohebbi, Charlotte Pouw, Gaofei Shen, Willem Zuidema, Martijn Bentum

PDF

1 Repo 1 Models

TL;DR

This study investigates how pre-training self-supervised speech models on Dutch enhances their ability to encode Dutch linguistic features, showing language-specific benefits over multilingual or English pre-training, with implications for speech recognition.

Contribution

It demonstrates that language-specific pre-training on Dutch improves linguistic feature encoding in self-supervised models compared to other pre-training data, highlighting the importance of language-specific training.

Findings

01

Dutch pre-training enhances phonetic and lexical encoding.

02

Language-specific pre-training outperforms multilingual or English pre-training.

03

Improved encoding correlates with better speech recognition performance.

Abstract

How language-specific are speech representations learned by self-supervised models? Existing work has shown that a range of linguistic features can be successfully decoded from end-to-end models trained only on speech recordings. However, it's less clear to what extent pre-training on specific languages improves language-specific linguistic information. Here we test the encoding of Dutch phonetic and lexical information in internal representations of self-supervised Wav2Vec2 models. Pre-training exclusively on Dutch improves the representation of Dutch linguistic features as compared to pre-training on similar amounts of English or larger amounts of multilingual data. This language-specific advantage is well-detected by trained clustering or classification probes, and partially observable using zero-shot metrics. Furthermore, the language-specific benefit on linguistic feature encoding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mdhk/ssl-nl-eval
noneOfficial

Models

🤗
amsterdamNLP/Wav2Vec2-NL
model· 442 dl· ♡ 1
442 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.