End-to-end streaming model for low-latency speech anonymization

Waris Quamer; Ricardo Gutierrez-Osuna

arXiv:2406.09277·eess.AS·November 4, 2024

End-to-end streaming model for low-latency speech anonymization

Waris Quamer, Ricardo Gutierrez-Osuna

PDF

Open Access

TL;DR

This paper introduces a low-latency, streaming speech anonymization model that effectively conceals speaker identity while maintaining speech quality, suitable for real-time applications.

Contribution

The authors develop an end-to-end streaming model for speaker anonymization that operates with significantly reduced latency and computational resources.

Findings

01

Full model achieves 230ms latency with high naturalness.

02

Lite version reduces latency to 66ms while preserving privacy.

03

State-of-the-art performance in intelligibility and privacy.

Abstract

Speaker anonymization aims to conceal cues to speaker identity while preserving linguistic content. Current machine learning based approaches require substantial computational resources, hindering real-time streaming applications. To address these concerns, we propose a streaming model that achieves speaker anonymization with low latency. The system is trained in an end-to-end autoencoder fashion using a lightweight content encoder that extracts HuBERT-like information, a pretrained speaker encoder that extract speaker identity, and a variance encoder that injects pitch and energy information. These three disentangled representations are fed to a decoder that re-synthesizes the speech signal. We present evaluation results from two implementations of our system, a full model that achieves a latency of 230ms, and a lite version (0.1x in size) that further reduces latency to 66ms while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis