Orthogonality and isotropy of speaker and phonetic information in self-supervised speech representations
Mukhtar Mohamed, Oli Danyi Liu, Hao Tang, Sharon Goldwater

TL;DR
This paper investigates the geometric properties of self-supervised speech representations, specifically orthogonality and isotropy, and their correlation with phonetic and speaker identification performance.
Contribution
It introduces the Cumulative Residual Variance (CRV) measure to assess orthogonality and isotropy in speech representations, linking these properties to downstream task performance.
Findings
Both orthogonality and isotropy correlate with phonetic probing accuracy.
Isotropy results are more nuanced and context-dependent.
The proposed CRV measure effectively assesses geometric properties of representations.
Abstract
Self-supervised speech representations can hugely benefit downstream speech technologies, yet the properties that make them useful are still poorly understood. Two candidate properties related to the geometry of the representation space have been hypothesized to correlate well with downstream tasks: (1) the degree of orthogonality between the subspaces spanned by the speaker centroids and phone centroids, and (2) the isotropy of the space, i.e., the degree to which all dimensions are effectively utilized. To study them, we introduce a new measure, Cumulative Residual Variance (CRV), which can be used to assess both properties. Using linear classifiers for speaker and phone ID to probe the representations of six different self-supervised models and two untrained baselines, we ask whether either orthogonality or isotropy correlate with linear probing accuracy. We find that both measures…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
