Interpreting Speaker Characteristics in the Dimensions of Self-Supervised Speech Features
Kyle Janse van Rensburg, Benjamin van Niekerk, Herman Kamper

TL;DR
This study investigates how self-supervised speech models encode speaker characteristics within individual feature dimensions, revealing that specific dimensions correspond to attributes like pitch, gender, and noise levels.
Contribution
It demonstrates that individual principal components in SSL speech features encode distinct speaker attributes and can be manipulated independently.
Findings
Principal component analysis reveals dimensions encoding pitch, gender, and noise.
Most speaker characteristics are isolated in separate dimensions, enabling targeted manipulation.
SSL features contain interpretable, disentangled speaker information.
Abstract
How do speech models trained through self-supervised learning structure their representations? Previous studies have looked at how information is encoded in feature vectors across different layers. But few studies have considered whether speech characteristics are captured within individual dimensions of SSL features. In this paper we specifically look at speaker information using PCA on utterance-averaged representations. For a range of SSL models, we find that the principal dimension that explains most variance encodes pitch and associated characteristics like gender. Other individual principal dimensions correlate with intensity, noise levels, the second formant, and higher frequency characteristics. We then use synthesis analyses to show that the dimensions for most characteristics are isolated from each other's influence. We further show that characteristics can be changed by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
