What Do Self-Supervised Speech Models Know About Words?
Ankita Pasad, Chung-Ming Chien, Shane Settle, Karen Livescu

TL;DR
This study investigates what linguistic information self-supervised speech models encode at the word level, revealing how training objectives and model size affect their understanding of words, boundaries, and semantics.
Contribution
The paper provides a comparative analysis of layer-wise representations in ten S3Ms, highlighting the influence of training objectives and model size on linguistic knowledge encoding.
Findings
Layer-wise representations vary in informativeness across layers.
Training objectives and model size significantly affect information distribution.
S3Ms with visual grounding outperform speech-only models on certain tasks.
Abstract
Many self-supervised speech models (S3Ms) have been introduced over the last few years, improving performance and data efficiency on various speech tasks. However, these empirical successes alone do not give a complete picture of what is learned during pre-training. Recent work has begun analyzing how S3Ms encode certain properties, such as phonetic and speaker information, but we still lack a proper understanding of knowledge encoded at the word level and beyond. In this work, we use lightweight analysis methods to study segment-level linguistic properties -- word identity, boundaries, pronunciation, syntactic features, and semantic features -- encoded in S3Ms. We present a comparative study of layer-wise representations from ten S3Ms and find that (i) the frame-level representations within each word segment are not all equally informative, and (ii) the pre-training objective and model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and dialogue systems
