Speaker Disentanglement of Speech Pre-trained Model Based on Interpretability
Xiaoxu Zhu, Junhua Li, Aaron J. Li, Guangchao Yao, Xiaojie Yu

TL;DR
This paper introduces a benchmark and a filtering method to measure and reduce residual speaker information in speech model embeddings, improving privacy without retraining.
Contribution
It presents InterpTRQE-SptME, a benchmark for quantifying speaker residuals, and InterpTF-SptME, a filtering technique that effectively removes speaker info from embeddings.
Findings
SHAP Noise filtering reduces speaker residuals from 18.05% to nearly zero.
The filtering maintains recognition accuracy with less than 1% increase in CTC loss.
The method is model-agnostic and requires no retraining.
Abstract
Self-supervised speech models learn representations that capture both content and speaker information. Yet this entanglement creates problems: content tasks suffer from speaker bias, and privacy concerns arise when speaker identity leaks through supposedly anonymized representations. We present two contributions to address these challenges. First, we develop InterpTRQE-SptME (Timbre Residual Quantitative Evaluation Benchmark of Speech pre-training Models Encoding via Interpretability), a benchmark that directly measures residual speaker information in content embeddings using SHAP-based interpretability analysis. Unlike existing indirect metrics, our approach quantifies the exact proportion of speaker information remaining after disentanglement. Second, we propose InterpTF-SptME, which uses these interpretability insights to filter speaker information from embeddings. Testing on VCTK…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
