Identifying Speaker Information in Feed-Forward Layers of Self-Supervised Speech Transformers
Tzu-Quan Lin, Hsi-Chun Cheng, Hung-yi Lee, Hao Tang

TL;DR
This paper investigates how self-supervised speech Transformers encode speaker information, identifying specific neurons linked to speaker traits, and demonstrates their importance by preserving performance through targeted pruning.
Contribution
The study uncovers neurons in feed-forward layers that encode speaker information and shows how protecting these neurons maintains speaker-related task performance.
Findings
Neurons correlated with speaker traits can be identified via clustering.
Protecting speaker-related neurons preserves speaker task performance.
Clusters correspond to phonetic and gender classes.
Abstract
In recent years, the impact of self-supervised speech Transformers has extended to speaker-related applications. However, little research has explored how these models encode speaker information. In this work, we address this gap by identifying neurons in the feed-forward layers that are correlated with speaker information. Specifically, we analyze neurons associated with k-means clusters of self-supervised features and i-vectors. Our analysis reveals that these clusters correspond to broad phonetic and gender classes, making them suitable for identifying neurons that represent speakers. By protecting these neurons during pruning, we can significantly preserve performance on speaker-related task, demonstrating their crucial role in encoding speaker information.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and dialogue systems
