U-vectors: Generating clusterable speaker embedding from unlabeled data
M. F. Mridha, Abu Quwsar Ohi, Muhammad Mostafa Monowar, Md. Abdul, Hamid, Md. Rashedul Islam, Yutaka Watanobe

TL;DR
This paper presents an unsupervised method for generating clusterable speaker embeddings from unlabeled speech data, improving robustness across diverse domains without relying on domain adaptation.
Contribution
Introduces u-vectors, an unsupervised approach to produce speaker embeddings from unlabeled data, reducing dependence on domain-specific training and adaptation.
Findings
Achieves satisfactory speaker recognition performance on multiple datasets.
Demonstrates robustness across different languages and domain shifts.
Uses pairwise architecture for effective unsupervised embedding generation.
Abstract
Speaker recognition deals with recognizing speakers by their speech. Most speaker recognition systems are built upon two stages, the first stage extracts low dimensional correlation embeddings from speech, and the second performs the classification task. The robustness of a speaker recognition system mainly depends on the extraction process of speech embeddings, which are primarily pre-trained on a large-scale dataset. As the embedding systems are pre-trained, the performance of speaker recognition models greatly depends on domain adaptation policy, which may reduce if trained using inadequate data. This paper introduces a speaker recognition strategy dealing with unlabeled data, which generates clusterable embedding vectors from small fixed-size speech frames. The unsupervised training strategy involves an assumption that a small speech segment should include a single speaker.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
