Dynamic Recognition of Speakers for Consent Management by Contrastive Embedding Replay
Arash Shahmansoori, Utz Roedig

TL;DR
This paper presents a novel speaker recognition system for consent management in voice assistants, capable of dynamic registration, removal, and re-registration of speakers using contrastive embedding replay, improving efficiency and performance.
Contribution
It introduces a contrastive training approach with embedding replay buffers for dynamic speaker management, addressing key challenges in consent-based voice recognition systems.
Findings
Outperforms existing methods in accuracy and efficiency.
Demonstrates effective dynamic registration and removal of speakers.
Ensures memory-efficient operation with robust speaker recognition.
Abstract
Voice assistants overhear conversations and a consent management mechanism is required. Consent management can be implemented using speaker recognition. Users that do not give consent enrol their voice and all their further recordings are discarded. Building speaker recognition-based consent management is challenging as dynamic registration, removal, and re-registration of speakers must be efficiently handled. This work proposes a consent management system addressing the aforementioned challenges. A contrastive based training is applied to learn the underlying speaker equivariance inductive bias. The contrastive features for buckets of speakers are trained a few steps into each iteration and act as replay buffers. These features are progressively selected using a multi-strided random sampler for classification. Moreover, new methods for dynamic registration using a portion of old…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Interpreting and Communication in Healthcare
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Contrastive Learning
