Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models

Michele Gubian; Ioana Krehan; Oli Liu; James Kirby; Sharon Goldwater

arXiv:2506.10855·cs.CL·June 13, 2025

Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models

Michele Gubian, Ioana Krehan, Oli Liu, James Kirby, Sharon Goldwater

PDF

Open Access

TL;DR

This study investigates how wav2vec2 speech models trained on multiple languages encode phonetic, tonal, and speaker information, revealing largely language-independent representations with some advantages for matched languages.

Contribution

It provides the first cross-linguistic analysis of self-supervised speech models, showing that their internal representations are largely orthogonal and language-independent across diverse languages.

Findings

01

Phones, tones, and speaker representations are largely orthogonal.

02

Layerwise probing accuracy patterns are similar across languages.

03

Matched-language probes have slight advantages in later layers.

Abstract

Analyses of self-supervised speech models have begun to reveal where and how they represent different types of information. However, almost all analyses have focused on English. Here, we examine how wav2vec2 models trained on four different languages encode both language-matched and non-matched speech. We use probing classifiers and geometric analyses to examine how phones, lexical tones, and speaker information are represented. We show that for all pretraining and test languages, the subspaces encoding phones, tones, and speakers are largely orthogonal, and that layerwise patterns of probing accuracy are similar, with a relatively small advantage for matched-language phone and tone (but not speaker) probes in the later layers. Our findings suggest that the structure of representations learned by wav2vec2 is largely independent of the speech material used during pretraining.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Language Development and Disorders