TL;DR
This paper introduces a factor-partitioned embedding framework for speech that separates multiple attributes into distinct subspaces, enabling attribute-conditioned retrieval and suppression of biases.
Contribution
It proposes a novel multi-axis embedding method that disentangles speech attributes into subspaces, improving retrieval and bias control over conventional single-vector embeddings.
Findings
Embeddings support attribute-conditioned retrieval with attribute suppression.
Signed axis weighting reduces same-speaker bias in cross-corpus retrieval.
Code implementation is publicly available at the provided GitHub URL.
Abstract
Speech encodes multiple simultaneous attributes -- linguistic content, speaker identity, dialect, gender --that conventional single-vector embeddings conflate. We present a factor-partitioned embedding framework that maps each utterance into a single vector whose subspaces correspond to distinct axes of variation. A shared acoustic encoder feeds per-axis linear projection heads, each trained via distillation from a specialist teacher or a contrastive objective over shared-label pairs. The resulting embeddings support attribute-conditioned retrieval: similarity is computed as a signed weighted sum over per-axis cosine scores, allowing retrieval that jointly considers what was said and how -- or explicitly suppresses one attribute to surface another. We evaluate on cross-corpus retrieval over corpora sharing the Harvard sentence prompts, demonstrating that signed axis weighting can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
