TL;DR
This paper introduces a novel group-adapted fusion network to improve fairness in speaker verification, significantly reducing performance disparities across demographic groups, especially underrepresented ones.
Contribution
The paper proposes a modular GFN architecture that mitigates demographic bias in speaker verification models, improving overall and group-specific accuracy.
Findings
Achieves up to 29% relative reduction in overall EER.
Reduces minority group EER by up to 18.6%.
Lowers EER disparity by 20-25%.
Abstract
Modern speaker verification models use deep neural networks to encode utterance audio into discriminative embedding vectors. During the training process, these networks are typically optimized to differentiate arbitrary speakers. This learning process biases the learning of fine voice characteristics towards dominant demographic groups, which can lead to an unfair performance disparity across different groups. This is observed especially with underrepresented demographic groups sharing similar voice characteristics. In this work, we investigate the fairness of speaker verification models on controlled datasets with imbalanced gender distributions, providing direct evidence that model performance suffers for underrepresented groups. To mitigate this disparity we propose the group-adapted fusion network (GFN) architecture, a modular architecture based on group embedding adaptation and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
