Dynamic Recognition of Speakers for Consent Management by Contrastive   Embedding Replay

Arash Shahmansoori; Utz Roedig

arXiv:2205.08459·cs.SD·October 28, 2024

Dynamic Recognition of Speakers for Consent Management by Contrastive Embedding Replay

Arash Shahmansoori, Utz Roedig

PDF

Open Access

TL;DR

This paper presents a novel speaker recognition system for consent management in voice assistants, capable of dynamic registration, removal, and re-registration of speakers using contrastive embedding replay, improving efficiency and performance.

Contribution

It introduces a contrastive training approach with embedding replay buffers for dynamic speaker management, addressing key challenges in consent-based voice recognition systems.

Findings

01

Outperforms existing methods in accuracy and efficiency.

02

Demonstrates effective dynamic registration and removal of speakers.

03

Ensures memory-efficient operation with robust speaker recognition.

Abstract

Voice assistants overhear conversations and a consent management mechanism is required. Consent management can be implemented using speaker recognition. Users that do not give consent enrol their voice and all their further recordings are discarded. Building speaker recognition-based consent management is challenging as dynamic registration, removal, and re-registration of speakers must be efficiently handled. This work proposes a consent management system addressing the aforementioned challenges. A contrastive based training is applied to learn the underlying speaker equivariance inductive bias. The contrastive features for buckets of speakers are trained a few steps into each iteration and act as replay buffers. These features are progressively selected using a multi-strided random sampler for classification. Moreover, new methods for dynamic registration using a portion of old…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Interpreting and Communication in Healthcare

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Contrastive Learning