Improving Speaker Identification for Shared Devices by Adapting Embeddings to Speaker Subsets
Zhenning Tan, Yuguang Yang, Eunjung Han, Andreas Stolcke

TL;DR
This paper introduces a household-adapted nonlinear embedding method that significantly improves speaker identification accuracy for shared devices by creating more distinct speaker clusters within households.
Contribution
The paper proposes a novel household-adapted nonlinear mapping to enhance speaker embeddings for better discrimination among household members sharing devices.
Findings
EER reduced by 45-71% in simulated households
EER reduced by 49.2% on real-world data
Household-adapted embeddings form more compact clusters
Abstract
Speaker identification typically involves three stages. First, a front-end speaker embedding model is trained to embed utterance and speaker profiles. Second, a scoring function is applied between a runtime utterance and each speaker profile. Finally, the speaker is identified using nearest neighbor according to the scoring metric. To better distinguish speakers sharing a device within the same household, we propose a household-adapted nonlinear mapping to a low dimensional space to complement the global scoring metric. The combined scoring function is optimized on labeled or pseudo-labeled speaker utterances. With input dropout, the proposed scoring model reduces EER by 45-71% in simulated households with 2 to 7 hard-to-discriminate speakers per household. On real-world internal data, the EER reduction is 49.2%. From t-SNE visualization, we also show that clusters formed by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
