Self-Supervised Speech Representations are More Phonetic than Semantic
Kwanghee Choi, Ankita Pasad, Tomohiko Nakamura, Satoru Fukayama, Karen, Livescu, Shinji Watanabe

TL;DR
This paper demonstrates that self-supervised speech models primarily encode phonetic information over semantic content, and questions the semantic capabilities of common intent classification datasets.
Contribution
The study introduces a novel dataset for analyzing word-level properties in S3Ms and provides evidence that these models are more phonetic than semantic in their representations.
Findings
S3M representations are more similar for phonetically similar words.
S3Ms show limited semantic similarity compared to phonetic similarity.
Common intent datasets may not effectively measure semantic understanding.
Abstract
Self-supervised speech models (S3Ms) have become an effective backbone for speech applications. Various analyses suggest that S3Ms encode linguistic properties. In this work, we seek a more fine-grained analysis of the word-level linguistic properties encoded in S3Ms. Specifically, we curate a novel dataset of near homophone (phonetically similar) and synonym (semantically similar) word pairs and measure the similarities between S3M word representation pairs. Our study reveals that S3M representations consistently and significantly exhibit more phonetic than semantic similarity. Further, we question whether widely used intent classification datasets such as Fluent Speech Commands and Snips Smartlights are adequate for measuring semantic abilities. Our simple baseline, using only the word identity, surpasses S3M-based models. This corroborates our findings and suggests that high scores…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Natural Language Processing Techniques
