Do self-supervised speech models develop human-like perception biases?

Juliette Millet; Ewan Dunbar

arXiv:2205.15819·cs.CL·June 1, 2022

Do self-supervised speech models develop human-like perception biases?

Juliette Millet, Ewan Dunbar

PDF

Open Access

TL;DR

This study investigates whether self-supervised speech models develop perception biases similar to humans, finding that some models form universal spaces while others show native language effects, with implications for low-resource language processing.

Contribution

It compares the perceptual spaces of three self-supervised speech models with human perception, revealing differences in language-specific and universal representations.

Findings

01

CPC model shows a small native language effect.

02

wav2vec 2.0 and HuBERT develop universal speech perception spaces.

03

Self-supervised models capture fine-grained perceptual phenomena.

Abstract

Self-supervised models for speech processing form representational spaces without using any external labels. Increasingly, they appear to be a feasible way of at least partially eliminating costly manual annotations, a problem of particular concern for low-resource languages. But what kind of representational spaces do these models construct? Human perception specializes to the sounds of listeners' native languages. Does the same thing happen in self-supervised models? We examine the representational spaces of three kinds of state-of-the-art self-supervised models: wav2vec 2.0, HuBERT and contrastive predictive coding (CPC), and compare them with the perceptual spaces of French-speaking and English-speaking human listeners, both globally and taking account of the behavioural differences between the two language groups. We show that the CPC model shows a small native language effect, but…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing

MethodsInfoNCE · Contrastive Predictive Coding