An Attribute-Aligned Strategy for Learning Speech Representation
Yu-Lin Huang, Bo-Hao Su, Y.-W. Peter Hong, Chi-Chun Lee

TL;DR
This paper introduces an attribute-aligned learning strategy using a layered-representation variational autoencoder to derive speech representations that protect privacy and improve task performance in speech emotion recognition and speaker verification.
Contribution
The paper proposes a novel attribute-aligned learning strategy with LR-VAE to produce flexible, identity-free, and emotionless speech representations, reducing complexity for privacy-preserving tasks.
Findings
Achieves competitive emotionless speech recognition performance.
Improves speaker verification accuracy with emotionless representations.
Reduces model complexity and training effort for multiple privacy tasks.
Abstract
Advancement in speech technology has brought convenience to our life. However, the concern is on the rise as speech signal contains multiple personal attributes, which would lead to either sensitive information leakage or bias toward decision. In this work, we propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism. Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes, to derive an identity-free representation for speech emotion recognition (SER), and an emotionless representation for speaker verification (SV). Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV, comparing to the current state-of-the-art method of using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
