Language-Invariant Multilingual Speaker Verification for the TidyVoice 2026 Challenge
Ze Li, Xiaoxiao Miao, Juan Liu, Ming Li

TL;DR
This paper introduces a language-invariant multilingual speaker verification system using a self-supervised model, adversarial training, and speech synthesis to improve cross-lingual robustness and performance in the TidyVoice 2026 Challenge.
Contribution
It proposes a novel multilingual SV system with language-invariant embeddings, combining self-supervised learning, adversarial training, and synthetic speech augmentation.
Findings
Fine-tuning improves performance.
Adversarial training enhances robustness.
Synthetic speech data boosts accuracy with limited data.
Abstract
Multilingual speaker verification (SV) remains challenging due to limited cross-lingual data and language-dependent information in speaker embeddings. This paper presents a language-invariant multilingual SV system for the TidyVoice 2026 Challenge. We adopt the multilingual self-supervised w2v-BERT 2.0 model as the backbone, enhanced with Layer Adapters and Multi-scale Feature Aggregation to better exploit multi-layer representations. A language-adversarial training strategy with a Gradient Reversal Layer is applied to promote language-invariant speaker embeddings. Moreover, a multilingual zero-shot text-to-speech system is used to synthesize speech in multiple languages, improving language diversity. Experimental results demonstrate that fine-tuning the large-scale pretrained model yields competitive performance, while language-adversarial training further enhances robustness. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
