Neural Representations for Modeling Variation in Speech
Martijn Bartelds, Wietse de Vries, Faraz Sanal, Caitlin Richter, Mark, Liberman, Martijn Wieling

TL;DR
This paper explores neural speech representations to quantify pronunciation variation, showing that transformer-based acoustic embeddings align better with human perception than traditional methods and capture nuanced speech features.
Contribution
It demonstrates the effectiveness of neural model-derived acoustic embeddings, especially from transformers, in modeling speech variation and matching human perceptual judgments.
Findings
Transformer-based speech representations outperform previous methods.
Middle hidden layers provide the most informative features.
Neural embeddings capture segmental, intonational, and durational differences.
Abstract
Variation in speech is often quantified by comparing phonetic transcriptions of the same utterance. However, manually transcribing speech is time-consuming and error prone. As an alternative, therefore, we investigate the extraction of acoustic embeddings from several self-supervised neural models. We use these representations to compute word-based pronunciation differences between non-native and native speakers of English, and between Norwegian dialect speakers. For comparison with several earlier studies, we evaluate how well these differences match human perception by comparing them with available human judgements of similarity. We show that speech representations extracted from a specific type of neural model (i.e. Transformers) lead to a better match with human perception than two earlier approaches on the basis of phonetic transcriptions and MFCC-based acoustic features. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Music and Audio Processing
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Dense Connections · Multi-Head Attention · Residual Connection · Dropout · Layer Normalization · Softmax
