Small footprint Text-Independent Speaker Verification for Embedded Systems
Julien Balian, Raffaele Tavarone, Mathieu Poumeyrol, Alice Coucke

TL;DR
This paper introduces a highly compact, efficient speaker verification model suitable for embedded systems, achieving competitive accuracy with significantly reduced computational resources and enabling real-time processing on small devices.
Contribution
A novel two-stage neural network architecture that is orders of magnitude smaller than existing solutions, maintaining competitive verification performance for embedded applications.
Findings
Achieved 3.31% EER on VoxCeleb1 with only 237.5K parameters.
Successfully run on Raspberry Pi 3B with under 200ms latency.
Limited performance degradation on VOiCES corpus despite model size reduction.
Abstract
Deep neural network approaches to speaker verification have proven successful, but typical computational requirements of State-Of-The-Art (SOTA) systems make them unsuited for embedded applications. In this work, we present a two-stage model architecture orders of magnitude smaller than common solutions (237.5K learning parameters, 11.5MFLOPS) reaching a competitive result of 3.31% Equal Error Rate (EER) on the well established VoxCeleb1 verification test set. We demonstrate the possibility of running our solution on small devices typical of IoT systems such as the Raspberry Pi 3B with a latency smaller than 200ms on a 5s long utterance. Additionally, we evaluate our model on the acoustically challenging VOiCES corpus. We report a limited increase in EER of 2.6 percentage points with respect to the best scoring model of the 2019 VOiCES from a Distance Challenge, against a reduction of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
