Siamese x-vector reconstruction for domain adapted speaker recognition
Shai Rozenberg, Hagai Aronowitz, Ron Hoory

TL;DR
This paper proposes Siamese x-vector Reconstruction (SVR), a domain adaptation technique that reconstructs high-quality speaker embeddings from low-quality signals, significantly improving speaker recognition accuracy under mismatched recording conditions.
Contribution
The paper introduces SVR, a novel Siamese DNN-based method for reconstructing high-quality speaker embeddings from degraded signals for better domain adaptation.
Findings
SVR significantly improves recognition accuracy in noisy and mismatched conditions.
The method outperforms baseline models across various mismatch scenarios.
Reconstruction quality enhances robustness of speaker recognition systems.
Abstract
With the rise of voice-activated applications, the need for speaker recognition is rapidly increasing. The x-vector, an embedding approach based on a deep neural network (DNN), is considered the state-of-the-art when proper end-to-end training is not feasible. However, the accuracy significantly decreases when recording conditions (noise, sample rate, etc.) are mismatched, either between the x-vector training data and the target data or between enrollment and test data. We introduce the Siamese x-vector Reconstruction (SVR) for domain adaptation. We reconstruct the embedding of a higher quality signal from a lower quality counterpart using a lean auxiliary Siamese DNN. We evaluate our method on several mismatch scenarios and demonstrate significant improvement over the baseline.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
