Neural Representations for Modeling Variation in Speech

Martijn Bartelds; Wietse de Vries; Faraz Sanal; Caitlin Richter; Mark; Liberman; Martijn Wieling

arXiv:2011.12649·cs.CL·January 27, 2022

Neural Representations for Modeling Variation in Speech

Martijn Bartelds, Wietse de Vries, Faraz Sanal, Caitlin Richter, Mark, Liberman, Martijn Wieling

PDF

Open Access 1 Repo

TL;DR

This paper explores neural speech representations to quantify pronunciation variation, showing that transformer-based acoustic embeddings align better with human perception than traditional methods and capture nuanced speech features.

Contribution

It demonstrates the effectiveness of neural model-derived acoustic embeddings, especially from transformers, in modeling speech variation and matching human perceptual judgments.

Findings

01

Transformer-based speech representations outperform previous methods.

02

Middle hidden layers provide the most informative features.

03

Neural embeddings capture segmental, intonational, and durational differences.

Abstract

Variation in speech is often quantified by comparing phonetic transcriptions of the same utterance. However, manually transcribing speech is time-consuming and error prone. As an alternative, therefore, we investigate the extraction of acoustic embeddings from several self-supervised neural models. We use these representations to compute word-based pronunciation differences between non-native and native speakers of English, and between Norwegian dialect speakers. For comparison with several earlier studies, we evaluate how well these differences match human perception by comparing them with available human judgements of similarity. We show that speech representations extracted from a specific type of neural model (i.e. Transformers) lead to a better match with human perception than two earlier approaches on the basis of phonetic transcriptions and MFCC-based acoustic features. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Bartelds/neural-acoustic-distance
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Music and Audio Processing

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Dense Connections · Multi-Head Attention · Residual Connection · Dropout · Layer Normalization · Softmax