Sparse Autoencoder Insights on Voice Embeddings
Daniel Pluth, Yu Zhou, Vijay K. Gurbani

TL;DR
This paper demonstrates that sparse autoencoders can effectively extract interpretable, mono-semantic features from speaker embeddings in audio data, extending explainability techniques beyond text-based models.
Contribution
It introduces the application of sparse autoencoders to non-textual speaker embeddings, revealing their potential for interpretability in audio domain embeddings.
Findings
Autoencoders extract features like language and music from speaker embeddings.
Extracted features show characteristics similar to those in LLM embeddings.
Autoencoders can manipulate specific features in embedded data.
Abstract
Recent advances in explainable machine learning have highlighted the potential of sparse autoencoders in uncovering mono-semantic features in densely encoded embeddings. While most research has focused on Large Language Model (LLM) embeddings, the applicability of this technique to other domains remains largely unexplored. This study applies sparse autoencoders to speaker embeddings generated from a Titanet model, demonstrating the effectiveness of this technique in extracting mono-semantic features from non-textual embedded data. The results show that the extracted features exhibit characteristics similar to those found in LLM embeddings, including feature splitting and steering. The analysis reveals that the autoencoder can identify and manipulate features such as language and music, which are not evident in the original embedding. The findings suggest that sparse autoencoders can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
