Self-supervised learning of speech representations with Dutch archival data

Nik Vaessen; Roeland Ordelman; David A. van Leeuwen

arXiv:2507.04554·cs.SD·July 9, 2025

Self-supervised learning of speech representations with Dutch archival data

Nik Vaessen, Roeland Ordelman, David A. van Leeuwen

PDF

Open Access

TL;DR

This study leverages Dutch archival TV broadcast data to improve self-supervised speech models, addressing data quality, pre-processing, and multilingual training, resulting in a state-of-the-art Dutch wav2vec 2.0 model.

Contribution

It introduces effective pre-processing strategies and demonstrates that monolingual pre-training yields more robust speech representations for Dutch.

Findings

01

Music, noise, and speaker overlap impact SSL convergence.

02

Pre-processing with Whisper improves data quality for SSL.

03

Monolingual pre-training outperforms multilingual in robustness.

Abstract

This paper explores the use of Dutch archival television broadcast data for self-supervised learning of speech foundation models, specifically wav2vec 2.0. We first study data quality assumptions for pre-training, and show how music, noise and speaker overlap affect SSL convergence and downstream fine-tuning performance. Secondly, we explore effectively pre-processing strategies to convert the noisy broadcast dataset into a qualitative dataset for pre-training, by using Whisper and WhisperX. Thirdly, we compare mono-lingual and multi-lingual pre-training with equivalent amounts of data, and show that mono-lingual pre-training is more robust to out-of-domain data. Lastly, we achieve a state-of-the-art LARGE wav2vec 2.0 model for the Dutch language, by a continuation of pre-training a wav2vec 2.0 XLS-R model checkpoint with our 55k hour archival dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Generative Adversarial Networks and Image Synthesis