What Can an Accent Identifier Learn? Probing Phonetic and Prosodic   Information in a Wav2vec2-based Accent Identification Model

Mu Yang; Ram C. M. C. Shekar; Okim Kang; John H. L. Hansen

arXiv:2306.06524·eess.AS·June 13, 2023·1 cites

What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model

Mu Yang, Ram C. M. C. Shekar, Okim Kang, John H. L. Hansen

PDF

Open Access

TL;DR

This paper investigates how fine-tuning a Wav2vec2-based model for accent identification enhances phoneme and prosody representations, revealing layer-specific encoding changes and accent-specific features through systematic probing analyses.

Contribution

It provides a detailed layer-wise analysis of SSL model representations, showing how accent identification fine-tuning enriches phonetic and prosodic encoding in specific layers.

Findings

01

Top layers learn richer phoneme and prosody features after fine-tuning.

02

Strong accent-specific phoneme representations are found in layer 9.

03

Fine-tuning effects are similar to those observed in automatic speech recognition tasks.

Abstract

This study is focused on understanding and quantifying the change in phoneme and prosody information encoded in the Self-Supervised Learning (SSL) model, brought by an accent identification (AID) fine-tuning task. This problem is addressed based on model probing. Specifically, we conduct a systematic layer-wise analysis of the representations of the Transformer layers on a phoneme correlation task, and a novel word-level prosody prediction task. We compare the probing performance of the pre-trained and fine-tuned SSL models. Results show that the AID fine-tuning task steers the top 2 layers to learn richer phoneme and prosody representation. These changes share some similarities with the effects of fine-tuning with an Automatic Speech Recognition task. In addition, we observe strong accent-specific phoneme representations in layer 9. To sum up, this study provides insights into the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and dialogue systems

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Absolute Position Encodings · Softmax · Byte Pair Encoding · Residual Connection · Adam · Dropout