Self-Supervised Speech Representations are More Phonetic than Semantic

Kwanghee Choi; Ankita Pasad; Tomohiko Nakamura; Satoru Fukayama; Karen; Livescu; Shinji Watanabe

arXiv:2406.08619·cs.CL·June 14, 2024

Self-Supervised Speech Representations are More Phonetic than Semantic

Kwanghee Choi, Ankita Pasad, Tomohiko Nakamura, Satoru Fukayama, Karen, Livescu, Shinji Watanabe

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that self-supervised speech models primarily encode phonetic information over semantic content, and questions the semantic capabilities of common intent classification datasets.

Contribution

The study introduces a novel dataset for analyzing word-level properties in S3Ms and provides evidence that these models are more phonetic than semantic in their representations.

Findings

01

S3M representations are more similar for phonetically similar words.

02

S3Ms show limited semantic similarity compared to phonetic similarity.

03

Common intent datasets may not effectively measure semantic understanding.

Abstract

Self-supervised speech models (S3Ms) have become an effective backbone for speech applications. Various analyses suggest that S3Ms encode linguistic properties. In this work, we seek a more fine-grained analysis of the word-level linguistic properties encoded in S3Ms. Specifically, we curate a novel dataset of near homophone (phonetically similar) and synonym (semantically similar) word pairs and measure the similarities between S3M word representation pairs. Our study reveals that S3M representations consistently and significantly exhibit more phonetic than semantic similarity. Further, we question whether widely used intent classification datasets such as Fluent Speech Commands and Snips Smartlights are adequate for measuring semantic abilities. Our simple baseline, using only the word identity, surpasses S3M-based models. This corroborates our findings and suggests that high scores…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

juice500ml/phonetic_semantic_probing
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Natural Language Processing Techniques