Speaker Disentanglement of Speech Pre-trained Model Based on Interpretability

Xiaoxu Zhu; Junhua Li; Aaron J. Li; Guangchao Yao; Xiaojie Yu

arXiv:2507.17851·cs.SD·April 2, 2026

Speaker Disentanglement of Speech Pre-trained Model Based on Interpretability

Xiaoxu Zhu, Junhua Li, Aaron J. Li, Guangchao Yao, Xiaojie Yu

PDF

TL;DR

This paper introduces a benchmark and a filtering method to measure and reduce residual speaker information in speech model embeddings, improving privacy without retraining.

Contribution

It presents InterpTRQE-SptME, a benchmark for quantifying speaker residuals, and InterpTF-SptME, a filtering technique that effectively removes speaker info from embeddings.

Findings

01

SHAP Noise filtering reduces speaker residuals from 18.05% to nearly zero.

02

The filtering maintains recognition accuracy with less than 1% increase in CTC loss.

03

The method is model-agnostic and requires no retraining.

Abstract

Self-supervised speech models learn representations that capture both content and speaker information. Yet this entanglement creates problems: content tasks suffer from speaker bias, and privacy concerns arise when speaker identity leaks through supposedly anonymized representations. We present two contributions to address these challenges. First, we develop InterpTRQE-SptME (Timbre Residual Quantitative Evaluation Benchmark of Speech pre-training Models Encoding via Interpretability), a benchmark that directly measures residual speaker information in content embeddings using SHAP-based interpretability analysis. Unlike existing indirect metrics, our approach quantifies the exact proportion of speaker information remaining after disentanglement. Second, we propose InterpTF-SptME, which uses these interpretability insights to filter speaker information from embeddings. Testing on VCTK…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.