LASPA: Language Agnostic Speaker Disentanglement with Prefix-Tuned Cross-Attention

Aditya Srinivas Menon; Raj Prakash Gohil; Kumud Tripathi; Pankaj Wasnik

arXiv:2506.02083·cs.SD·June 4, 2025

LASPA: Language Agnostic Speaker Disentanglement with Prefix-Tuned Cross-Attention

Aditya Srinivas Menon, Raj Prakash Gohil, Kumud Tripathi, Pankaj Wasnik

PDF

Open Access

TL;DR

This paper introduces LASPA, a novel language-agnostic speaker disentanglement method using prefix-tuned cross-attention, which improves multi-lingual speaker recognition by effectively separating linguistic and speaker information.

Contribution

The paper presents a new disentanglement learning strategy with prefix-tuned cross-attention that enhances speaker recognition across multiple languages, including unseen ones.

Findings

01

Improves equal error rate across multiple datasets

02

Effectively separates language from speaker embeddings

03

Generalizes well to unseen languages

Abstract

Speaker recognition models face challenges in multi-lingual settings due to the entanglement of linguistic information within speaker embeddings. The overlap between vocal traits such as accent, vocal anatomy, and a language's phonetic structure complicates separating linguistic and speaker information. Disentangling these components can significantly improve speaker recognition accuracy. To this end, we propose a novel disentanglement learning strategy that integrates joint learning through prefix-tuned cross-attention. This approach is particularly effective when speakers switch between languages. Experimental results show the model generalizes across monolingual and multi-lingual settings, including unseen languages. Notably, the proposed model improves the equal error rate across multiple datasets, highlighting its ability to separate language information from speaker embeddings and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques